Kavraki Lab
S. M. Kim, M. I. Peña, M. Moll, G. Giannakopoulos, G. N. Bennett, and L. E. Kavraki, “An Evaluation of Different Clustering Methods and Distance Measures Used for Grouping Metabolic Pathways,” in Eighth International Conference on Bioinformatics and Computational Biology (BICoB), 2016.


Large-scale annotated metabolic databases, such as KEGG and MetaCyc, provide a wealth of information to researchers designing novel biosynthetic pathways. However, many metabolic pathfinding tools that assist in identifying possible solution pathways fail to facilitate the grouping and interpretation of these pathway results. Clustering possible solution pathways can help users of pathfinding tools quickly notice major patterns and unique pathways without having to sift through individual results one by one. In this paper, we assess the ability of three separate clustering methods (hierarchical, k-means, and k-medoids) along with three pair-wise distance measures (Levenshtein, Jaccard, and n-gram) to expertly group lysine, isoleucine, and 3-hydroxypropanoic acid (3-HP) biosynthesis pathways. The quality of the resulting clusters were quantitatively evaluated against expected pathway groupings taken from the literature. Hierarchical clustering and Levenshtein distance seemed to best match external pathway labels across the three biosynthesis pathways. The lysine biosynthesis pathways, which had the most distinct separation of pathways, had better quality clusters than isoleucine and 3-HP, showing that clustering methods may not be capable of grouping pathways with more complex underlying topographies.

PDF preprint: http://kavrakilab.org/publications/kim.pena2016an-evaluation-of-different-clustering-methods.pdf