Table of Contents
Fetching ...

Efficient Approximation of Molecular Kinetics using Random Fourier Features

Feliks Nüske, Stefan Klus

TL;DR

By combining the RFF approach and model selection by means of the VAMP score, it is shown that kernel parameters can be efficiently tuned and accurate estimates of slow molecular kinetics can be obtained for several benchmarking systems, such as deca alanine and the NTL9 protein.

Abstract

Slow kinetic processes of molecular systems can be analyzed by computing dominant eigenpairs of the Koopman operator or its generator. In this context, the Variational Approach to Markov Processes (VAMP) provides a rigorous way of discerning the quality of different approximate models. Kernel methods have been shown to provide accurate and robust estimates for slow kinetic processes, but are sensitive to hyper-parameter selection, and require the solution of large-scale generalized eigenvalue problems, which can easily become computationally demanding for large data sizes. In this contribution, we employ a stochastic approximation of the kernel based on random Fourier features (RFFs), to derive a small-scale dual eigenvalue problem which can easily be solved. We provide an interpretation of this procedure in terms of a finite randomly generated basis set. By combining the RFF approach and model selection by means of the VAMP score, we show that kernel parameters can be efficiently tuned, and accurate estimates of slow molecular kinetics can be obtained for several benchmarking systems, such as deca alanine and the NTL9 protein.

Efficient Approximation of Molecular Kinetics using Random Fourier Features

TL;DR

By combining the RFF approach and model selection by means of the VAMP score, it is shown that kernel parameters can be efficiently tuned and accurate estimates of slow molecular kinetics can be obtained for several benchmarking systems, such as deca alanine and the NTL9 protein.

Abstract

Slow kinetic processes of molecular systems can be analyzed by computing dominant eigenpairs of the Koopman operator or its generator. In this context, the Variational Approach to Markov Processes (VAMP) provides a rigorous way of discerning the quality of different approximate models. Kernel methods have been shown to provide accurate and robust estimates for slow kinetic processes, but are sensitive to hyper-parameter selection, and require the solution of large-scale generalized eigenvalue problems, which can easily become computationally demanding for large data sizes. In this contribution, we employ a stochastic approximation of the kernel based on random Fourier features (RFFs), to derive a small-scale dual eigenvalue problem which can easily be solved. We provide an interpretation of this procedure in terms of a finite randomly generated basis set. By combining the RFF approach and model selection by means of the VAMP score, we show that kernel parameters can be efficiently tuned, and accurate estimates of slow molecular kinetics can be obtained for several benchmarking systems, such as deca alanine and the NTL9 protein.
Paper Structure (25 sections, 53 equations, 5 figures, 1 algorithm)

This paper contains 25 sections, 53 equations, 5 figures, 1 algorithm.

Figures (5)

  • Figure 1: Results for the Lemon-Slice potential. A: Contour of the potential \ref{['eq:lemonslice']}. B: VAMP score as a function of the kernel bandwidth for different data sizes $m$ and feature numbers $p$. C: Leading non-trivial eigenvalues of the generator for $m = 5000$ and $p = 50$, as a function of the bandwidth. Black lines indicate the Markov model reference values. D: Decomposition into four metastable states based on eigenvectors for $\sigma = 0.4$, $p = 50$, $m = 5000$. All error bars are based on twenty independent simulations.
  • Figure 2: Results for alanine dipeptide. A: Free Energy (in $\mathrm{kJ/mol}$) in two-dimensional dihedral space. B: VAMP Score for selected feature sizes $p$ and lag times $t$ as a function of the kernel bandwidth. C: Implied timescales for $t = 100\,\mathrm{ps}$ and $p = 50$ as a function of the bandwidth. D: Metastable decomposition obtained for $\sigma = 0.6,\, p = 50$ and $t = 100\,\mathrm{ps}$.
  • Figure 3: Results for the Koopman generator on backbone dihedral angle space of the deca alanine peptide. A: VAMP score for selected data sizes $m$ and feature sizes $p$ as a function of the kernel bandwidth. B: Eigenvalues for $m = 1000$ and $p =300$ as a function of the bandwidth. The reference MSM results are shown in black. Re-scaling MSM eigenvalues by the average ratio between optimal RFF and MSM eigenvalues leads to the magenta lines. C--F: Representative structures for each of the four PCCA states based on the RFF model at $m =1000$, $p = 300$, $\sigma = 4.0$.
  • Figure 4: Results for NTL9 protein. A: VAMP score for Gaussian RFF approximation as a function of the kernel bandwidth $\sigma$. Red and blue lines show the results using all 666 distances, for different lag times and different numbers of Fourier features $p$. The magenta lines show the average values over ten random selections of only $50$ distances, with $p = 300$ fixed. B: Slowest implied timescale $t_1$ as a function of the lag time $t$. Green lines show estimates based on linear TICA using the top-ranked 20, 100, and 300 distances. Blue lines show RFF-based estimates on all distances using different values of $p$, using $\sigma = 15.0$. Red lines are averages over the above-mentioned random subsamples of the distance coordinates, where $p = 300$ is fixed.
  • Figure 5: Top Row: Representations of the two PCCA states obtained for $\sigma = 15$, $p = 300$, $t = 2\,\mu s$. For each PCCA state, we show the fraction of simulation time during which each residue-residue pair forms a contact, see also Nueske2021. Bottoms Row: Representative protein structure for both PCCA states.

Theorems & Definitions (1)

  • Example 2.1