Table of Contents
Fetching ...

Applied causality to infer protein dynamics and kinetics

Akashnathan Aranganathan, Eric R. Beyerle

TL;DR

This work addresses the lack of timescale information in conformational ensembles produced by generative molecular models. It couples AlphaFold2-derived ensembles to a causal, memoryless overdamped Langevin framework (LE4PD) by parameterizing a potential of mean force from the ensemble covariance, enabling residue-level kinetics via autocorrelation functions and the integrated timescale $\tau_{\mathrm{avg}}$. Across six HIV-1 protease variants, the study shows an inverse relationship between AlphaFold2 MSA depth and ensemble timescales, with shallow MSA probing longer, MD-like dynamics and deeper MSA aligning with shorter, MD-like timescales; reweighting out-of-set structures further refines the predictions. The results generalize to other generative ML models and even to AFmultimer for dimer dynamics, offering a route toward causally meaningful, fast-generation predictions of protein kinetics with potential integration into ML training and reweighting schemes for improved physical fidelity.

Abstract

The use of generative machine learning models, trained on the experimentally resolved structures deposited in the protein data bank, is an attractive approach to sampling conformational ensembles of proteins. However, the ensembles generated by these models lack timescale or causal information. We use the structural ensembles generated from AlphaFold2 at a range of MSA depths to parameterize the potential of mean force of an overdamped, memory-free, coarse-grained Langevin equation. This approach couples the AlphaFold2 ensembles to a causal model, allowing us to estimate the timescales spanned by the ensembles generated at each MSA depth. Performing this analysis on six variants of HIV-1 protease, we confirm an inverse relationship between MSA depth and the timescale of an ensemble's conformational fluctuations. The MSA depth essentially serves as a conformational restraint, and AlphaFold2 is generally able to probe timescales at or below those seen in microsecond-long, unbiased molecular dynamics simulations. We conclude by generalizing this approach to other generative structural ensemble-prediction methods as well as co-folding models, in this case the biologically functional HIV-1 protease dimer.

Applied causality to infer protein dynamics and kinetics

TL;DR

This work addresses the lack of timescale information in conformational ensembles produced by generative molecular models. It couples AlphaFold2-derived ensembles to a causal, memoryless overdamped Langevin framework (LE4PD) by parameterizing a potential of mean force from the ensemble covariance, enabling residue-level kinetics via autocorrelation functions and the integrated timescale . Across six HIV-1 protease variants, the study shows an inverse relationship between AlphaFold2 MSA depth and ensemble timescales, with shallow MSA probing longer, MD-like dynamics and deeper MSA aligning with shorter, MD-like timescales; reweighting out-of-set structures further refines the predictions. The results generalize to other generative ML models and even to AFmultimer for dimer dynamics, offering a route toward causally meaningful, fast-generation predictions of protein kinetics with potential integration into ML training and reweighting schemes for improved physical fidelity.

Abstract

The use of generative machine learning models, trained on the experimentally resolved structures deposited in the protein data bank, is an attractive approach to sampling conformational ensembles of proteins. However, the ensembles generated by these models lack timescale or causal information. We use the structural ensembles generated from AlphaFold2 at a range of MSA depths to parameterize the potential of mean force of an overdamped, memory-free, coarse-grained Langevin equation. This approach couples the AlphaFold2 ensembles to a causal model, allowing us to estimate the timescales spanned by the ensembles generated at each MSA depth. Performing this analysis on six variants of HIV-1 protease, we confirm an inverse relationship between MSA depth and the timescale of an ensemble's conformational fluctuations. The MSA depth essentially serves as a conformational restraint, and AlphaFold2 is generally able to probe timescales at or below those seen in microsecond-long, unbiased molecular dynamics simulations. We conclude by generalizing this approach to other generative structural ensemble-prediction methods as well as co-folding models, in this case the biologically functional HIV-1 protease dimer.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures.

Figures (8)

  • Figure 1: Overview of our method. a) The protein sequence is input to the machine learning architecture of choice, generating an ensemble of all-atom configurations. This ensemble is coarse-grained at the C$_{\alpha}$ level and input to the Langevin Equation for Protein Dynamics (LE4PD), which predicts the relaxation timescales (kinetics) of each residues' structural fluctuations (dynamics). b) A representative structure of the HIV-1 protease monomer is shown, with the flap (residues 49-51) and N- and C-termini labeled. c) Residue-dependent autocorrelation functions as a function of MSA depth input to AF2 for the three labeled structural regions in (b). d) Predicted relaxation time of each residue as a function of residue index for one HIV-1 protease sequence, PDB ID: 1EBW.
  • Figure 2: Estimated timescales for the relaxation kinetics of each residue for the six indicated sequences of HIV-1 protease at MSA depths of a) 512, b) 64, and c) 8. Each sequence variant label on the right-hand side of the plot is printed with the same color as its integrated timescale curve.
  • Figure 3: Measuring the volume of the space of the slow dynamics spanned by the AlphaFold2 conformational ensembles. a) AlphaFold2-generated ensemble projected onto the free-energy surface spanned by the two slowest LE4PD modes parameterized by a one microsecond MD simulation of the HIV-1 protease sequence encoded by PDB ID: 1EBW. b) Comparing the relative variance in the space spanned by the two slowest LE4PD modes of the AlphaFold2-generated ensembles compared to the one-microsecond MD simulation. We also show in the bottom plot what fraction of structures generated by AlphaFold2 at the indicated MSA depth are outside the support of the one-microsecond dynamics.
  • Figure 4: Reporting the timescales measured by AlphaFold2 conformational ensembles generated using different MSA depths. a) Percent of each residue's ACF best measured by each of the reported MSAs at three different timescales : 10, 100, and 1000 ns. Colored circles denote the average across sequences at each MSA while grey circle denote the result for each sequence individually. b) Scatter plot and linear fits (colored, dashed lines) of the average integrated correlation time from the reference MD simulations ($\tau_{\text{avg}}$ MD sim.) and the LE4PD theory parameterized using the AlphaFold2 conformational ensembles at the given MSA depth ($\tau_{\text{avg}}$ rMSA AF2). Reported are the correlation times from MD and AlphaFold2 for the HIV-1 protease sequence encoded in PDB ID: 1EBW at the 1000 ns timescale. c) Pearson correlation coefficient between the timescales reported in b), but for all six HIV-1 protease sequences studied here and for all three timescales (10, 100, and 1000 ns). Also reported in the second column are the slopes of the linear regression of $\tau_{\text{avg}}$ MD sim. onto $\tau_{\text{avg}}$ rMSA AF2 for all six sequences (1Q9P, 2PC0, 3TTP, 1EBW, 4Z4X, 6P9A) and all three timescales (10, 100, and 1000 ns).
  • Figure 5: Effect of out-of-set structure removal on the predicted dynamics. All comparisons to MD are performed at the 1000 ns timescale. a) Root-mean-square fluctuations along the alpha-carbons predicted from the one microsecond MD simulation of HIV-1 protease with the sequence encoded by PDB ID: 1EBW (black) compared to the AlphaFold2 ensemble generated using an MSA depth of 8 and starting from the same sequence with (red, solid) and without (red, dashed) out-of-set structures in the ensemble input to the LE4PD theory. b) Correlation between the integrated correlation timescale predicted with the entire AlphaFold2 ensemble ($\tau_{\text{ML}}$) and the ensemble with out-of-set structures removed ($\tau_{\text{avg}}$ ML (filtered) for all six sequences. The points are colored by the sequence given in the subplot's legend. c) Integrated timescales for 1EBW $\tau_{\text{avg}}$ from the LE4PD theory parameterized using the AF2 ensemble generated using an MSA depth of 8 with (blue) and without (red) out-of-set structures included in the ensemble. Including the data points from all six sequences, the Pearson correlation coefficient between the timescales calculated from the ensembles with and without out-of-set structures is 0.912.
  • ...and 3 more figures