Metabolism refers to the chemical reactions, typically catalyzed by enzymes, that occur to sustain life. Experimental studies, coupled with advancements in molecular biology and improved computational methods, have generated increasingly large amounts of data on metabolic reactions and pathways in the last few decades. As a result, many specialized databases have been created in order to store and organize this metabolic data, such as KEGG and MetaCyc. The data is usually presented as many small subpathways that are manually divided based on function and sometimes by organism. However, it can be difficult to navigate these subpathways to find connections between compounds, especially for discovering novel or non-standard pathways that may span multiple organisms. The metabolic network of even a single species is highly complex, as shown by the depiction of E. coli’s metabolic network to the right. These type of pathways can be important for a number of applications such as metabolic engineering, understanding the metabolic scope of multi-species communities and metabolic network reconstruction. Combining parts of pathways existing in different organisms can lead to new ways of considering how metabolism works in complex communities and provide novel ways to synthesize important and useful compounds. Computational tools can provide a way to automatically find biologically interesting pathways in metabolic data and reveal pathways that may have not been identified by manual means. The overall goal of this project is to develop algorithms paired with user-friendly software tools to help assist and accelerate the discovery and analysis of these novel metabolic pathways. This project is a collaboration with Dr. George Bennett and his lab. A brief description of the algorithms we have developed to find biologically meaningful metabolic pathways can be found below.
Computational tools can provide a way to automatically find biologically interesting pathways in metabolic data and reveal pathways that may have not been identified by manual means. The primary problem that computational methods for metabolic pathfinding try to solve is given a start and target compound, find “biologically meaningful” or “realistic” pathways of enzymatic reactions that make the target compound from the start compound. In the last few years the availability of atomic level information on which atoms in the substrate compounds correspond to which atoms in the product compounds, called atom mapping data, has been steadily increasing. This data can be found in sources such as the KEGG RPAIR database. The algorithms we have developed harness atom mapping data to track atoms through metabolic networks to find pathways which conserve atoms from a given start compound to a give target compound. Atom tracking is a crucial feature for finding meaningful metabolic pathways because it essentially eliminates spurious connections and reactions that do not correspond to useful or real biochemical pathways or reactions. Furthermore, atom tracking enables finding branched metabolic pathways.
We developed a new algorithm for finding atom conserving linear pathways in metabolic networks, called LPAT for Linear Pathfinding with Atom Tracking. LPAT takes as input a start compound, a target compound, the minimum number of atoms to be conserved, the number of pathways to return, k, and a special data structure, termed an atom mapping graph. LPAT is then guaranteed to return the k shortest pathways between the start and target compounds that conserve at least a given number of atoms. LPAT was shown to efficiently and accuracy find known linear metabolic pathways in a metabolic network containing thousands of reactions obtained from KEGG.
In addition to eliminating spurious transitions, atom tracking identifies where atoms are lost and gained along a linear pathway and enables finding branched pathways. The first step of BPAT-S is to use LPAT to obtain a set of linear pathways between the desired start and target compounds. BPAT-S then annotates and stores the linear pathways with information about the specific reactions and compounds through which atoms are lost or gained. These annotated linear pathways are called seed pathways and they are indexed for efficient processing and attachment of branches. The branches are identified by calling LPAT to find linear pathways between compounds through which atoms are lost to compounds through which atoms are gained. All valid combinations of these branches are systematically attached to the seed pathways to obtain the resulting set of branched pathways, which are then ranked first by the number of atoms they conserve and then by the total number of reactions they contain. The results from BPAT-S demonstrate that the algorithm can efficiently find and return branched pathways that correspond to known branched pathways. An example of a branched pathways found by BPAT-S between alpha-D-glucose 6-phosphate and L-tryptophan can be found below.
The seventh branched pathway returned by BPAT-S between alpha-D-glucose 6-phosphate and L-tryptophan reveals the interesting topology of the tryptophan biosynthesis pathway. Compound nodes are ovals, mapping nodes are boxes containg the KEGG RPAIR ID and associated EC numbers and reactions are diamonds containing the KEGG REACTION ID and associated EC numbers. Each edge in the branched pathway corresponds to one molecule. The numbers are to assist in following the pathway, as the figure is split in half for better display.