The impact of conformer quality on learned representations of molecular conformer ensembles

Keir Adams; Connor W. Coley

The impact of conformer quality on learned representations of molecular conformer ensembles

Keir Adams, Connor W. Coley

TL;DR

This work investigates how the geometric quality of input conformers affects 3D ML surrogates predicting properties of high-quality conformer ensembles. Using 14 3D GNN models (based on DimeNet++) to predict ensemble-level Sterimol descriptors for carboxylic acids, it systematically varies input conformer quality (DFT, xTB, MMFF94) and encoding strategies (single active, random, augmented, decoy-sets). The main finding is that, for ensemble-level Sterimol properties, surrogates trained on high-quality ensembles rarely outperform simply computing descriptors from cheap ensembles, and set-based encodings offer limited gains; performance deteriorates when the active conformer is not present or is degraded, though data augmentation improves robustness. These results provide practical guidance: for high-throughput scenarios where the target depends on an active conformer, it can be more cost-effective to rely on cheap conformer ensembles rather than training costly 3D surrogates, though 3D models may offer selective benefits for geometry-sensitive descriptors and robust learning in some contexts.

Abstract

Training machine learning models to predict properties of molecular conformer ensembles is an increasingly popular strategy to accelerate the conformational analysis of drug-like small molecules, reactive organic substrates, and homogeneous catalysts. For high-throughput analyses especially, trained surrogate models can help circumvent traditional approaches to conformational analysis that rely on expensive conformer searches and geometry optimizations. Here, we question how the performance of surrogate models for predicting 3D conformer-dependent properties (of a single, active conformer) is affected by the quality of the 3D conformers used as their input. How well do lower-quality conformers inform the prediction of properties of higher-quality conformers? Does the fidelity of geometry optimization matter when encoding random conformers? For models that encode sets of conformers, how does the presence of the active conformer that induces the target property affect model accuracy? How do predictions from a surrogate model compare to estimating the properties from cheap ensembles themselves? We explore these questions in the context of predicting Sterimol parameters of conformer ensembles optimized with density functional theory. Although answers will be case-specific, our analyses provide a valuable perspective on 3D representation learning models and raise practical considerations regarding when conformer quality matters.

The impact of conformer quality on learned representations of molecular conformer ensembles

TL;DR

Abstract

The impact of conformer quality on learned representations of molecular conformer ensembles

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)