Broad and extensive knowledge of the biological function of proteins would have immense practical impact on the identification of novel drug targets, the reduction of potential side effects, and on finding the molecular causes of disease. Unfortunately, the experimental determination of protein function is an expensive and time consuming process. In an effort to accelerate and guide the experimental process, computational techniques have been developed to use functional information about well-studied proteins to annotate predictably similar but less-studied proteins.
We are currently investigating the following problems:
We have developed an algorithm called LabelHash to quickly identify substructure matches in a large set of target structures. The algorithm preprocesses the targets to create hash tables that allow for the look up of partial matches in constant time. With a variant of our previous substructure matching algorithm (called MASH) complete matches are computed from these partial matches. Structures are currently represented by C-alpha coordinates labeled with the corresponding residue types. With our nonparametric statistical model we can assign a p-value to each match. The LabelHash algorithm has been made accessible through a web server and can also be downloaded. The algorithm runs in parallel. Typically, computing matches for a 5–10 residue motif in the entire 95% sequence identity filtered non-redundant PDB takes only a couple of minutes on our 8-core web server. The matches can be visualized and further analyzed in Chimera with our ViewMatch plugin. Below, a match (in green) is shown superimposed with a motif (in white), while the rest of the matching protein is shown in ribbon representation.
The Family-wise Analysis of SubStructural Templates (FASST) method uses all-against-all substructure comparison to identify high-level trends among functionally homologous proteins. For example, in the figure below, the catalytic substructures of 83 heme-dependent peroxidases are being compared simultaneously to see if structural variation exists, and if so, what biological explanation exists for the structural variation. In the substructure alignment below (shown in (a)), the color of each 5-residue substructure corresponds to its cluster assignment by FASST (shown in (b)). In the example shown below, the clusters identified are strongly correlated with the phylogenetic sub-groups of heme-dependent peroxidases. Furthermore, the structural clusters automatically identified by FASST can then be used to construct motif ensembles where a single active site model is actually represented by multiple structure examples. We have demonstrated that using a motif ensemble rather than a single-structure motif can vastly improve functional annotation sensitivity in many cases. For complete details and further experiments, please see the full paper.
The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method provides a semi-supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. We have applied CCORPS to the problem of identifying structural features of the kinase ATP binding site that are informative of inhibitor binding. The protein kinases are a large family of enzymes that play fundamental roles in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding affinity profile of 8 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors
Highly predictive clusters. (a) Structure of lck (PDB:2pl0) with a 3-position substructure shown in blue stick representation (Thr-316, Tyr-318, Gly-322) and bound imatinib molecule in red. (b) Substructure embedding computed by CCORPS when comparing the 3-positions shown in A across the entire 1958 structure dataset. Each point in the clustering represents a single 3-residue substructure. The coloring indicates the cluster membership of each substructure (21 clusters in total are shown). (c) Aligned 3-residue substructure representatives, from each of the 21 clusters identified by CCORPS, for the 3-position subset shown in A. The color of each substructure corresponds to its cluster assignment. (d) Same embedding as in B, but now colored according to affinity. The red and black coloring of each point indicates true and false affinity labels for flavopiridol, respectively, while white indicates substructures lacking affinity annotations.
We have also looked at computational methods to characterize and systematize the definition of good motifs. Once such technique is Geometric Sieving. The goal of Geometric Sieving is to propose geometric criteria to optimize known motifs and make them more effective motifs. An effective motif retains geometric and chemical similarity to substructures in functional homologs, while maintaining geometric and chemical dissimilarity to with all substructures of functionally unrelated proteins.
Geometric Sieving operates by measuring “geometric uniqueness,” a property which reflects the median geometric similarity between a motif and a reference set of protein structures. Different motifs have different degrees of geometric uniqueness, but we found that geometrically unique motifs tend to be effective as well. Geometric Sieving measures geometric uniqueness for every subset of a given input motif, but the most direct method involves computing millions of matches. In recent papers we describe one method for identifying the most geometrically unique subset without exhaustive match computation.
Designing high-quality motifs that accurately represent the most significant and unique structural features of the protein being model is critical to the function prediction performance of motif matching algorithms (such as LabelHash/MASH). Amino acids likely to have functional importance have often been found to reside within large cavities and clefts on protein surfaces, but these clefts can grow quite large, making them difficult to model. Often times in these large clefts or binding site regions, a much smaller subset of the amino acids present are more directly involved with the functional activity of the cavity. Given a larger input motif, Geometric Sieving identifies a smaller subset of amino acids that still uniquely represent the function of the modeled protein by statistically comparing the “geometric uniqueness” of each amino subset.
For any given motif, we can match it against either the entire PDB or a representative subset, such as the non-redundant PDB (nrPDB) or curated subsets, such as those by CATH and SCOP. Matching against one of the background sets provides us with a “profile” for a given motif, such as that shown below.
A desirable motif is geometrically similar (low Least Root Mean Square Deviation) to functionally homologous proteins and dissimilar to functionally unrelated proteins. By identifying those motifs that maximize the separation of related and unrelated proteins, high-performance motifs can be identified. For each subset motif, a profile is calculated by matching against the given background set as shown below.
After computing these profiles (which are actually probability density functions) for each subset motif, we then select the subset motif that pushes the median of the profile farthest to the right (to higher LRMSD values). This selected subset motif is then returned as a geometrically unique subset of the original input motif. Examples of the subset motif profiles computed for two different input motifs are shown below. Each blue curve is a subset motif profile; the two black curves are the least and most unique subsets.