Table of Contents
Fetching ...

Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification

Parisa Mollaei, Amir Barati Farimani

TL;DR

Proteins encode function through conformation, but extracting structure–function links from high-dimensional MD trajectories is challenging. The authors introduce a Kernel–PCA pipeline with an angular kernel $K( ext{\lambda}_1,\text{\lambda}_2,\text{\lambda}_3)$ that maps atomic coordinates into a feature space, followed by PCA to a 2D representation and selection by the correlation ratio $C_r$ to identify and rank reaction coordinates (RCs) via their relation to a protein property. The method recovers known activation coordinates in the β2 adrenergic receptor and reveals RCs driving folding in small proteins, with network-like interactions among top RCs; using CB atoms often preserves information while drastically reducing feature size. This framework offers a generalizable, efficient tool for MD-based structure–function analysis and RC interpretation, with potential implications for drug design and protein engineering.

Abstract

In this study, we propose a Kernel-PCA model designed to capture structure-function relationships in a protein. This model also enables ranking of reaction coordinates according to their impact on protein properties. By leveraging machine learning techniques, including Kernel and principal component analysis (PCA), our model uncovers meaningful patterns in high-dimensional protein data obtained from molecular dynamics (MD) simulations. The effectiveness of our model in accurately identifying reaction coordinates has been demonstrated through its application to a G protein-coupled receptor. Furthermore, this model utilizes a network-based approach to uncover correlations in the dynamic behavior of residues associated with a specific protein property. These findings underscore the potential of our model as a powerful tool for protein structure-function analysis and visualization.

Protein Structure-Function Relationship: A Kernel-PCA Approach for Reaction Coordinate Identification

TL;DR

Proteins encode function through conformation, but extracting structure–function links from high-dimensional MD trajectories is challenging. The authors introduce a Kernel–PCA pipeline with an angular kernel that maps atomic coordinates into a feature space, followed by PCA to a 2D representation and selection by the correlation ratio to identify and rank reaction coordinates (RCs) via their relation to a protein property. The method recovers known activation coordinates in the β2 adrenergic receptor and reveals RCs driving folding in small proteins, with network-like interactions among top RCs; using CB atoms often preserves information while drastically reducing feature size. This framework offers a generalizable, efficient tool for MD-based structure–function analysis and RC interpretation, with potential implications for drug design and protein engineering.

Abstract

In this study, we propose a Kernel-PCA model designed to capture structure-function relationships in a protein. This model also enables ranking of reaction coordinates according to their impact on protein properties. By leveraging machine learning techniques, including Kernel and principal component analysis (PCA), our model uncovers meaningful patterns in high-dimensional protein data obtained from molecular dynamics (MD) simulations. The effectiveness of our model in accurately identifying reaction coordinates has been demonstrated through its application to a G protein-coupled receptor. Furthermore, this model utilizes a network-based approach to uncover correlations in the dynamic behavior of residues associated with a specific protein property. These findings underscore the potential of our model as a powerful tool for protein structure-function analysis and visualization.

Paper Structure

This paper contains 12 sections, 2 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of Kernel-PCA model. Trajectories are initially prepared to provide atomic coordinates as input for a Kernel model (K($\lambda_1$, $\lambda_2$, $\lambda_3$)). It is followed by PCA to generate 2D representations for Kernel outputs. Finally, a defined Correlation ratio ($C_r$) selects the optimal representation. This representation identifies and ranks reaction coordinates based on their significance to the protein property while also uncovering correlations among them.
  • Figure 2: Kernel-PCA representations of the NTL9 protein (a, b) and $\beta_2$ adrenergic receptor (c, d) corresponding to the lowest (a, c) and highest (b, d) $C_r$ values. For the NTL9 protein: (a) K(0, 1, 0) with $C_r$ = 0.05, (b) K(0.5,0.5,0) with $C_r$ = 0.15. For the $\beta_2$ adrenergic receptor c) K(0,1,0) with $C_r$ = 0.03, d) K(0.75,0,0.25) with $C_r$ = 0.13.
  • Figure 3: Identification of reaction coordinates in $\beta_2AR$ using Kernel-PCA model. a) The known reaction coordinates in $\beta_2AR$ are the intracellular parts of TM6 during the transition from inactive to active state b) H3-H6 distances mapped onto the representation. c) amino acid types located in where essential conformational changes occur during activation process. d) top ten PC1 reaction coordinates, indicating dynamics of residues with the highest contribution to activation states of the receptor. e) dynamic motion of ${A271}^{6.33}$ projected onto the PC representation.
  • Figure 4: RMSDs projected onto the optimal representation, illustrating the maximum Correlation ratio achieved using all atoms for (a) Protein B, (c) NTL9, (e) Trp-Cage, and (g) Chignolin. Panels (b), (d), (f), and (h) show their corresponding representations generated using only the $CB$ atoms.
  • Figure 5: Strong correlation between the top two residues associated with the protein property. a) relationship between ${A271}^{6.33}$ and ${L272}^{6.34}$ residues in the inactive (blue), intermediate (green), and active (red) states of $\beta_2AR$. b) interaction between L6 and K7 residues in the unfolded (blue), intermediate (green), and folded (red) states in NTL9 protein.