Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Mateusz Praski; Jakub Adamczyk; Wojciech Czech

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

Mateusz Praski, Jakub Adamczyk, Wojciech Czech

TL;DR

This study interrogates the assumed progress in pretrained molecular embeddings by benchmarking 25 models across 25 diverse datasets using a fair, fixed-embedding protocol and a Bayesian Bradley-Terry framework. The analysis reveals that, aside from the CLAMP fingerprint-based model, most neural representations fail to outperform the strong baseline ECFP fingerprints, and several transformer-based approaches offer only practical equivalence with substantial computational cost. By enforcing rigorous evaluation and uncertainty-aware ranking via ROPE, the work highlights the need for domain-specific inductive biases and robust baselines in molecular representation learning. The findings have practical implications for model selection in drug discovery and chemoinformatics, urging researchers to prioritize reproducible benchmarks and cost-efficient baselines while continuing to refine representation learning with targeted, chemistry-informed improvements.

Abstract

Pretrained neural networks have attracted significant interest in chemistry and small molecule drug design. Embeddings from these models are widely used for molecular property prediction, virtual screening, and small data learning in molecular chemistry. This study presents the most extensive comparison of such models to date, evaluating 25 models across 25 datasets. Under a fair comparison framework, we assess models spanning various modalities, architectures, and pretraining strategies. Using a dedicated hierarchical Bayesian statistical testing model, we arrive at a surprising result: nearly all neural models show negligible or no improvement over the baseline ECFP molecular fingerprint. Only the CLAMP model, which is also based on molecular fingerprints, performs statistically significantly better than the alternatives. These findings raise concerns about the evaluation rigor in existing studies. We discuss potential causes, propose solutions, and offer practical recommendations.

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

TL;DR

Abstract

Benchmarking Pretrained Molecular Embedding Models For Molecular Representation Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)