Kavraki Lab


Protein motion plays a pivotal role in many biological studies that consider protein flexibility, such as drug discovery, protein folding, and other motion-related areas. When flexibility information is needed, researchers typically resort to the use of simulations (molecular dynamics, monte carlo, etc.) to sample protein conformations not seen in experimentally resolved structures, but that are physically possible. All-atom simulations are computationally very expensive for any medium-to-large size protein, only sampling atomic vibrations even on clusters of computers. Since most biological processes of interest occur at the microsecond/millisecond time frame, coarse-grained simulation models have been developed to sample more relevant, large-scale motions for protein folding and flexibility studies.

The use of linear dimensionality reduction techniques such as Principal Components Analysis (PCA) has been applied to simulation data to produce a low-dimensional scatterplot of the simulation points that can be used to study the main geometric Degrees of Freedom (DOFs) of the system and use these coordinates to compute, for example, free-energy surfaces.

Two main uses of dimensionality reduction are:

  • Produce a low-dimensional representation of the simulation points. Arranging simulation points in a low-dimensional Euclidean space can provide a global “map” (or guide) to the process being studied; in the case of protein folding, free-energy surfaces can be computed as a function of this low-dimensional coordinates, that work as reaction coordinates.
  • Extract a few, collective DOFs, from the simulation data. Simulations for proteins with N atoms, each moving in 3D space, provides us typically with 3N numbers for each conformation. Since the protein’s atoms are constrained by physical forces, they cannot move independently of each other, so the actual, effective DOFs are far less than 3N. These effective DOFs are typically referred to as collective DOFs since they are supposed to summarize the collective motion of atoms, working together to maintain the structural integrity of the protein.


We have developed a nonlinear dimensionality reduction technique called ScIMAP, inspired by the recently proposed Isomap algorithm, to analyze protein simulation data in the order of millions of samples. ScIMAP is a nonlinear technique that captures curved collective motions of atoms, especially suited to study protein folding and other significant protein motion problems.

ScIMAP considers the shape-wise connectivity of the conformational space to extract the main, curved collective atom motions and to arrange the simulation data along these new coordinates. Coordinates are ordered by variance, ranking the extracted DOFs in order of motion importance. Points that are closer together in the ScIMAP-extracted coordinate space represent conformations that are easily accessible from one another, according to the physical simulation.

Application: Reaction Coordinates for Protein Folding through ScIMAP

Here is an example of the application of ScIMAP to produce a two-dimensional map of protein conformations gathered through folding/unfolding molecular dynamics simulations (using a coarse-grained model). The protein shown here is SH3 (src homology 3) which has 57 residues and is known to be a two-state folder.

Two-dimensional representation of a folding/unfolding simulation of SH3. Notice how ScIMAP captures the conformational freedom difference between the folded (towards the left) and unfolded (right) states, as well as the main folding route. Using the ScIMAP coordinates to compute a free-energy landscape of the SH3 folding/unfolding reaction. Red-to-blue indicates high-to-low free energy values. Note how clearly the folded and unfolded basins are captured, together with the transition state through the main folding route.

ScIMAP captures the nonlinear nature of the folding process and arranges the simulation conformations along axes that represent the main geometric changes that happen during folding. The first coordinate (horizontal) captures the main direction of the folding mechanism, and the second coordinate (vertical) adds necessary detail to show the conformational freedom difference between the folded and unfolded states. The use of the second coordinate also seems to indicate the presence of an alternate route for folding, that has a lower probability.

ScIMAP can compute as many “dimensions” or “coordinates” as in the input data, ordered by importance. An example of a three-dimensional free energy landscape for the SH3 data set used above is shown below.

Three-dimensional free-energy surface, represented as a collection of isosurfaces with similar free energy values. Representative conformations from the simulations have been superimposed on the plot to show how ScIMAP arranges the protein shapes into the low-dimensional map.