Table of Contents
Fetching ...

adabmDCA 2.0 -- a flexible but easy-to-use package for Direct Coupling Analysis

Lorenzo Rosset, Roberto Netti, Anna Paola Muntoni, Martin Weigt, Francesco Zamponi

TL;DR

adabmDCA 2.0 delivers a flexible, energy-based Direct Coupling Analysis framework implemented in C++, Julia, and Python with a unified CLI. It combines dense bmDCA learning and two sparse topologies eaDCA and edDCA, enabling downstream tasks such as residue contact prediction, mutational-effect scoring, and sequence generation for proteins and RNAs. The method relies on Boltzmann learning with Monte Carlo gradient estimates and persistent contrastive divergence to fit one- and two-site statistics from MSAs, with pseudocount and sequence weighting to correct biases. The package emphasizes practical usability, convergence diagnostics, and modularity across hardware, providing robust tools for structure prediction and sequence design in biomolecular research.

Abstract

In this methods article, we provide a flexible but easy-to-use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \texttt{adabmDCA 2.0} is available in different programming languages (C++, Julia, Python) usable on different architectures (single-core and multi-core CPU, GPU) using a common front-end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue-residue contact prediction, mutational-effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.

adabmDCA 2.0 -- a flexible but easy-to-use package for Direct Coupling Analysis

TL;DR

adabmDCA 2.0 delivers a flexible, energy-based Direct Coupling Analysis framework implemented in C++, Julia, and Python with a unified CLI. It combines dense bmDCA learning and two sparse topologies eaDCA and edDCA, enabling downstream tasks such as residue contact prediction, mutational-effect scoring, and sequence generation for proteins and RNAs. The method relies on Boltzmann learning with Monte Carlo gradient estimates and persistent contrastive divergence to fit one- and two-site statistics from MSAs, with pseudocount and sequence weighting to correct biases. The package emphasizes practical usability, convergence diagnostics, and modularity across hardware, providing robust tools for structure prediction and sequence design in biomolecular research.

Abstract

In this methods article, we provide a flexible but easy-to-use implementation of Direct Coupling Analysis (DCA) based on Boltzmann machine learning, together with a tutorial on how to use it. The package \texttt{adabmDCA 2.0} is available in different programming languages (C++, Julia, Python) usable on different architectures (single-core and multi-core CPU, GPU) using a common front-end interface. In addition to several learning protocols for dense and sparse generative DCA models, it allows to directly address common downstream tasks like residue-residue contact prediction, mutational-effect prediction, scoring of sequence libraries and generation of artificial sequences for sequence design. It is readily applicable to protein and RNA sequence data.

Paper Structure

This paper contains 37 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Schematic representation of a DCA model.
  • Figure 2: Schematic representation of the sparse model training. A) edDCA, the sparsification is obtained by progressively pruning contacts from an initial fully connected model. B) eaDCA, the couplings are progressively added during the training.
  • Figure 3: Example of RNA sequences formatted in FASTA format.
  • Figure 4: One minus the Pearson correlation coefficient between the data and chains correlation matrices as a function of the training time for the protein family PF00072. The curve is well approximated by a power law decay.
  • Figure 6: Analysis of a bmDCA model. Left: measuring the mixing time of the model using $10^4$ chains. The curves represent the average overlap among randomly initialized samples (dark blue) and the one among the same sequences between times $t$ and $t/2$ (light blue). Shaded areas represent the error of the mean. When the two curves merge, we can assume that the chains at time $t$ forgot the memory of the chains at time $t/2$. This point gives us an estimate of the mixing time of the model, $t^{\mathrm{mix}}$. Notice that the times start from 1, so the starting conditions are not shown. Right: Scatter plot of the entries of the covariance matrix of the data versus that of the generated samples.