Table of Contents
Fetching ...

The impact of conformer quality on learned representations of molecular conformer ensembles

Keir Adams, Connor W. Coley

TL;DR

This work investigates how the geometric quality of input conformers affects 3D ML surrogates predicting properties of high-quality conformer ensembles. Using 14 3D GNN models (based on DimeNet++) to predict ensemble-level Sterimol descriptors for carboxylic acids, it systematically varies input conformer quality (DFT, xTB, MMFF94) and encoding strategies (single active, random, augmented, decoy-sets). The main finding is that, for ensemble-level Sterimol properties, surrogates trained on high-quality ensembles rarely outperform simply computing descriptors from cheap ensembles, and set-based encodings offer limited gains; performance deteriorates when the active conformer is not present or is degraded, though data augmentation improves robustness. These results provide practical guidance: for high-throughput scenarios where the target depends on an active conformer, it can be more cost-effective to rely on cheap conformer ensembles rather than training costly 3D surrogates, though 3D models may offer selective benefits for geometry-sensitive descriptors and robust learning in some contexts.

Abstract

Training machine learning models to predict properties of molecular conformer ensembles is an increasingly popular strategy to accelerate the conformational analysis of drug-like small molecules, reactive organic substrates, and homogeneous catalysts. For high-throughput analyses especially, trained surrogate models can help circumvent traditional approaches to conformational analysis that rely on expensive conformer searches and geometry optimizations. Here, we question how the performance of surrogate models for predicting 3D conformer-dependent properties (of a single, active conformer) is affected by the quality of the 3D conformers used as their input. How well do lower-quality conformers inform the prediction of properties of higher-quality conformers? Does the fidelity of geometry optimization matter when encoding random conformers? For models that encode sets of conformers, how does the presence of the active conformer that induces the target property affect model accuracy? How do predictions from a surrogate model compare to estimating the properties from cheap ensembles themselves? We explore these questions in the context of predicting Sterimol parameters of conformer ensembles optimized with density functional theory. Although answers will be case-specific, our analyses provide a valuable perspective on 3D representation learning models and raise practical considerations regarding when conformer quality matters.

The impact of conformer quality on learned representations of molecular conformer ensembles

TL;DR

This work investigates how the geometric quality of input conformers affects 3D ML surrogates predicting properties of high-quality conformer ensembles. Using 14 3D GNN models (based on DimeNet++) to predict ensemble-level Sterimol descriptors for carboxylic acids, it systematically varies input conformer quality (DFT, xTB, MMFF94) and encoding strategies (single active, random, augmented, decoy-sets). The main finding is that, for ensemble-level Sterimol properties, surrogates trained on high-quality ensembles rarely outperform simply computing descriptors from cheap ensembles, and set-based encodings offer limited gains; performance deteriorates when the active conformer is not present or is degraded, though data augmentation improves robustness. These results provide practical guidance: for high-throughput scenarios where the target depends on an active conformer, it can be more cost-effective to rely on cheap conformer ensembles rather than training costly 3D surrogates, though 3D models may offer selective benefits for geometry-sensitive descriptors and robust learning in some contexts.

Abstract

Training machine learning models to predict properties of molecular conformer ensembles is an increasingly popular strategy to accelerate the conformational analysis of drug-like small molecules, reactive organic substrates, and homogeneous catalysts. For high-throughput analyses especially, trained surrogate models can help circumvent traditional approaches to conformational analysis that rely on expensive conformer searches and geometry optimizations. Here, we question how the performance of surrogate models for predicting 3D conformer-dependent properties (of a single, active conformer) is affected by the quality of the 3D conformers used as their input. How well do lower-quality conformers inform the prediction of properties of higher-quality conformers? Does the fidelity of geometry optimization matter when encoding random conformers? For models that encode sets of conformers, how does the presence of the active conformer that induces the target property affect model accuracy? How do predictions from a surrogate model compare to estimating the properties from cheap ensembles themselves? We explore these questions in the context of predicting Sterimol parameters of conformer ensembles optimized with density functional theory. Although answers will be case-specific, our analyses provide a valuable perspective on 3D representation learning models and raise practical considerations regarding when conformer quality matters.

Paper Structure

This paper contains 17 sections, 13 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (Top) Simulating properties of conformer ensembles typically involves a multi-step workflow beginning with the generation of high-quality conformer ensembles, which requires an initial conformer search, structural clustering, (high-level) geometry optimization, and energy filtering. Properties are then simulated for each conformer individually, followed by some kind of aggregation (e.g., Boltzmann-averaging, or using the maximum simulated property value of an "active" conformation). (Middle) Once trained, machine learning models can be used to shortcut expensive conformational analyses by directly predicting the ensemble-level properties from cheap-to-simulate molecular representations like a molecular graph, a single low-level conformation, or a set of low-level conformations. (Bottom) In this work, we consider how conformer quality impacts the performance of machine learning models that are trained to predict properties of high-quality conformer ensembles.
  • Figure 2: Performance of 3D machine learning surrogate models trained to predict Sterimol-B5 (min/max) or Sterimol-L (min/max) descriptors of DFT-optimized conformer ensembles from encodings of random MMFF94-optimized conformers, as a function of training set size. Performance is measured by the mean absolute error on the test set, averaged across the three test sets. We compare model accuracy against the accuracy of simply computing these descriptors from cheap-to-simulate conformer ensembles optimized with either MMFF94 or xTB.
  • Figure 3: Comparison of average prediction error across the test sets for different types of representation learning models that encode the true "active" conformation re-optimized with either DFT (black), xTB (red), or MMFF94 (teal); a single random conformer at various optimization levels; a single random conformer but training with data augmentation; or a set of random conformers at various optimization levels. Model performance is evaluated by the relative increase in error compared to the model that encodes the true DFT-level "active" conformation, which serves as an upper bound.
  • Figure 4: Comparison in performance between ML surrogate models that encode the true "active" conformation that has been re-optimized (i.e., corrupted) with DFT, xTB, or MMFF94; models that encode "decoy-sets" containing the active conformer and up to 9 other decoys at the same level of theory; and models that encode sets of up to 10 random xTB- or MMFF94-optimized conformations. Performance is evaluated by the relative increase in MAE on the test set compared to the model that encodes the true DFT-level active conformation, and is averaged across three test sets.
  • Figure 5: Improvement in model performance on select molecule-, atom-, and bond-level descriptors when using DimeNet++ (operating on random MMFF94 conformers) versus the 2D GNN developed by haas2024rapid, reported as the change in R2 on the original test sets in haas2024rapid.