Table of Contents
Fetching ...

Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance

Romina Wild, Felix Wodaczek, Vittorio Del Tatto, Bingqing Cheng, Alessandro Laio

TL;DR

The authors present Differentiable Information Imbalance (DII), a method to optimize feature weights, align units, and find the best feature set size, available in the Python library DADApy.

Abstract

Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.

Automatic feature selection and weighting in molecular systems using Differentiable Information Imbalance

TL;DR

The authors present Differentiable Information Imbalance (DII), a method to optimize feature weights, align units, and find the best feature set size, available in the Python library DADApy.

Abstract

Feature selection is essential in the analysis of molecular systems and many other fields, but several uncertainties remain: What is the optimal number of features for a simplified, interpretable model that retains essential information? How should features with different units be aligned, and how should their relative importance be weighted? Here, we introduce the Differentiable Information Imbalance (DII), an automated method to rank information content between sets of features. Using distances in a ground truth feature space, DII identifies a low-dimensional subset of features that best preserves these relationships. Each feature is scaled by a weight, which is optimized by minimizing the DII through gradient descent. This allows simultaneously performing unit alignment and relative importance scaling, while preserving interpretability. DII can also produce sparse solutions and determine the optimal size of the reduced feature space. We demonstrate the usefulness of this approach on two benchmark molecular problems: (1) identifying collective variables that describe conformations of a biomolecule, and (2) selecting features for training a machine-learning force field. These results show the potential of DII in addressing feature selection challenges and optimizing dimensionality in various applications. The method is available in the Python library DADApy.

Paper Structure

This paper contains 25 sections, 11 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: $\boldsymbol{DII}$ feature selection applied to Gaussian random variables and their monomials. A: The input features are ten independent and identically distributed Gaussian random variables, X$^1$-X$^{10}$. The same features are used as ground truth, but scaled. I: Differentiable Information Imbalance ($DII$), with (orange) and without (blue) L$_1$ regularization in the optimization. The insets show two exemplary features, with the weights during optimization (orange) and the ground truth weights (gray). II: Cosine similarity (overlap) of the ground truth and optimized weights in gray, and $DII$s in black with colored markers, for several L$_1$ strengths and associated numbers of non-zero features. Table \ref{['table:1']} provides the ground truth and optimized weights for points in this graph. B: The feature space consists of the 285 monomials up to order three of the ten Gaussian random variables from A. As ground truth, ten features were selected at random and scaled, while all the other feature weights are zero. I and II: Analogous to A. Table \ref{['table:2']} provides the ground truth and optimized weights for points in this graph. Source data are provided as a Source Data file.
  • Figure 2: $\boldsymbol{DII}$ feature selection for describing the free energy landscape and conformations of CLN025. A: Green: Optimal Differentiable Information Imbalance ($DII$) results for collective variable (CV) subsets of different sizes with gradient descent optimized weights for 1429 data points evenly sampled from the full trajectory. The green star marks the $DII$ result of the optimally scaled 3-plet, which defines the coordinate system for B. Inset: $DII$ gradient descent optimization for the optimal 5-plet. Blue and orange: Average and standard deviations of the $DII$ calculated from block cross validation with 4 non-overlapping training data sets and 84 validation sets of 1428 points each. B: Free energy isosurfaces in the space of the optimal 3-plet of CVs (radius of gyration (RGYR), principal components 1 and 2 (PC1 and PC2), with weights of 1.0, 3.5 and 4.7), corresponding to three different values of the free energy. The renderings around the free energy surfaces show sampled conformations of the peptide at different values of the CVs and free energy. C: Red and blue renderings are cluster centers obtained from the optimal 3-plet space and from the full space of all pairwise heavy atom distances, respectively. The two main cluster centers of both belong to the dominant peptide conformations: The $\beta$-pin and the collapsed denatured state. The collapsed and $\beta$-pin clusters identified in the optimal 3-plet space share 92% and 87% of the frames with the corresponding full space clusters. Source data are provided as a Source Data file.
  • Figure 3: $\boldsymbol{DII}$ feature selection for efficient training of a Machine Learning Potential (MLP). A: Differentiable Information Imbalance ($DII$) selecting the optimal feature subsets from $D_A=176$ Atom Centered Symmetry Functions (ACSF) descriptors, against a ground truth of $D_B=546$ Smooth Overlap of Atomic Orbitals (SOAP) descriptors, using a data set of $N\sim 350$ atomic environments. The optimized $DII$ per number of non-zero features is shown by blue circles and orange diamonds, using L$_1$ regularized search and greedy backward selection, respectively. The filled area represents validation data in the form of the minimum and maximum $DII$ on 10 batches of $\sim$350 atomic environments other than the $\sim$350 environments used for $DII$ feature selection. The $DII$ for randomly selecting a certain number of non-zero features is depicted as gray bars between the lowest and highest $DII$ found within 10 random selections. B: Test root-mean-square error (RMSE) with features chosen via L$_1$ regularized $DII$ (blue circles) and at random (gray triangles) by Behler-Parrinello-type MLPs Behler2007 as implemented in n2p2 Singraber2019Singraber2019_2. Six MLPs with different train-test splits per number of non-zero features are trained. Markers represent their average RMSE, the filled area shows the range from worst to best performer. C: Run-time of force and energy prediction on a single structure performed by the same MLPs as in B. The filled area shows the range from worst to best performer, despite being barely visible due to similar run-times across the six MLPs. Source data are provided as a Source Data file.