An Evaluation of Different Clustering Methods and Distance Measures Used for Grouping Metabolic Pathways

S. M. Kim, M. I. Peña, M. Moll, G. Giannakopoulos, G. N. Bennett, and L. E. Kavraki, “An Evaluation of Different Clustering Methods and Distance Measures Used for Grouping Metabolic Pathways,” in 2016 International Conference on Bioinformatics and Computational Biology. ISCA, 2016, pp. 115–122.

Abstract

Large-scale annotated metabolic databases, such as KEGG and MetaCyc, provide a wealth of information to researchers designing novel biosynthetic pathways. However, many metabolic pathfinding tools that assist in identifying possible solution pathways fail to facilitate the grouping and interpretation of these pathway results. Clustering possible solution pathways can help users of pathfinding tools quickly identify major patterns and unique pathways without having to sift through individual results one by one. In this paper, we assess the ability of three separate clustering methods (hierarchical, k -means, and k -medoids) along with three pair-wise distance measures (Levenshtein, Jaccard, and n -gram) to expertly group lysine, isoleucine, and 3-hydroxypropanoic acid (3-HP) biosynthesis pathways. The quality of the resulting clusters were quantitatively evaluated against expected pathway groupings taken from the literature. Hierarchical clustering and Levenshtein distance seemed to best match external pathway labels across the three biosynthesis pathways. The lysine biosynthesis pathways, which had the most distinct separation of pathways, had better quality clusters than isoleucine and 3-HP, suggesting that grouping pathways with more complex underlying topologies may require more tailored clustering methods.

PDF preprint: http://kavrakilab.org/publications/kim-pena2016an-evaluation-of-different-clustering-methods.pdf