Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Yanqiao Zhu; Jeehyun Hwang; Keir Adams; Zhen Liu; Bozhao Nan; Brock Stenfors; Yuanqi Du; Jatin Chauhan; Olaf Wiest; Olexandr Isayev; Connor W. Coley; Yizhou Sun; Wei Wang

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Yanqiao Zhu, Jeehyun Hwang, Keir Adams, Zhen Liu, Bozhao Nan, Brock Stenfors, Yuanqi Du, Jatin Chauhan, Olaf Wiest, Olexandr Isayev, Connor W. Coley, Yizhou Sun, Wei Wang

TL;DR

The paper introduces MARCEL, a benchmark for learning from conformer ensembles to address molecular flexibility in MRL. It formalizes Boltzmann-averaged targets over conformer ensembles and evaluates 1D, 2D, and 3D representations, plus two ensemble strategies that explicitly incorporate multiple conformers. Across four diverse datasets and nine regression tasks, conformer-ensemble methods generally improve performance, with DeepSets-based set encoders and training-time conformer sampling offering notable gains on several tasks, though gains are not universal and depend on data size and task. The work highlights the practical potential and current limitations of conformer-ensemble learning, urging more efficient architectures and sampling strategies to better exploit conformational diversity in real-world catalytic and chemical design tasks.

Abstract

Molecular Representation Learning (MRL) has proven impactful in numerous biochemical applications such as drug discovery and enzyme design. While Graph Neural Networks (GNNs) are effective at learning molecular representations from a 2D molecular graph or a single 3D structure, existing works often overlook the flexible nature of molecules, which continuously interconvert across conformations via chemical bond rotations and minor vibrational perturbations. To better account for molecular flexibility, some recent works formulate MRL as an ensemble learning problem, focusing on explicitly learning from a set of conformer structures. However, most of these studies have limited datasets, tasks, and models. In this work, we introduce the first MoleculAR Conformer Ensemble Learning (MARCEL) benchmark to thoroughly evaluate the potential of learning on conformer ensembles and suggest promising research directions. MARCEL includes four datasets covering diverse molecule- and reaction-level properties of chemically diverse molecules including organocatalysts and transition-metal catalysts, extending beyond the scope of common GNN benchmarks that are confined to drug-like molecules. In addition, we conduct a comprehensive empirical study, which benchmarks representative 1D, 2D, and 3D molecular representation learning models, along with two strategies that explicitly incorporate conformer ensembles into 3D MRL models. Our findings reveal that direct learning from an accessible conformer space can improve performance on a variety of tasks and models.

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

TL;DR

Abstract

Paper Structure (25 sections, 1 equation, 4 figures, 5 tables)

This paper contains 25 sections, 1 equation, 4 figures, 5 tables.

Introduction
Problem Formulation
Datasets and Tasks
Benchmarking Molecular Representation Learning Models
1D Models
2D Graph Neural Networks
3D Graph Neural Networks
Incorporating Conformer Ensembles into Molecular Representations
Strategy 1: Training-Time Data Augmentation via Conformer Sampling
Strategy 2: Ensemble Learning with Explicit Set Encoders
Experiments
Experimental Configurations
Results and Analysis
Discussions and Conclusions
Dataset Description
...and 10 more sections

Figures (4)

Figure 1: We present a MARCEL benchmark that comprehensively evaluates the potential of learning on conformer ensembles across a diverse set of molecules, datasets, and models.
Figure 2: Conformer ensemble learning with explicit set encoders (Strategy 2). Individual conformer embeddings are first obtained via 3D GNN encoders. Then, a set encoder is employed to aggregate conformer embeddings. Finally, a linear projection head is used to generate the prediction.
Figure 3: Performance changes of four conformer ensemble learning strategies on the basis of six 3D graph models. Here, negative values (marked in ) denote reduced Mean Absolute Error (MAE), signifying a performance improvement due to the incorporation of conformer ensembles.
Figure S1: Histogram of the ratio of the variance of each conformer property to the variance of each Boltzmann-averaged property in the Kraken dataset.

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

TL;DR

Abstract

Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks

Authors

TL;DR

Abstract

Table of Contents

Figures (4)