Table of Contents
Fetching ...

Can Molecular Foundation Models Know What They Don't Know? A Simple Remedy with Preference Optimization

Langzhou He, Junyou Zhu, Fangxin Wang, Junhua Liu, Haoyan Xu, Yue Zhao, Philip S. Yu, Qitian Wu

TL;DR

This work tackles the instability of molecular foundation models on out-of-distribution molecules, including chemical hallucination, by introducing Mole-PAIR. Mole-PAIR is a lightweight, post-training detector that plugs into frozen molecular encoders and optimizes a pairwise ranking objective derived from Bradley–Terry theory, with a temperature parameter β, to align training with AUROC. The authors provide theoretical guarantees showing convergence to Bayes-optimal ranking and demonstrate that the method prioritizes hard and borderline ID–OOD pairs during learning. Empirically, Mole-PAIR yields substantial improvements over a suite of baselines on DrugOOD and GOOD benchmarks across multiple distribution shifts, using only feature-based OOD scores from a lightweight head attached to fixed encoders. The results support the practical utility of a model-agnostic, ranking-based reliability enhancement for molecular discovery pipelines.

Abstract

Molecular foundation models are rapidly advancing scientific discovery, but their unreliability on out-of-distribution (OOD) samples severely limits their application in high-stakes domains such as drug discovery and protein design. A critical failure mode is chemical hallucination, where models make high-confidence yet entirely incorrect predictions for unknown molecules. To address this challenge, we introduce Molecular Preference-Aligned Instance Ranking (Mole-PAIR), a simple, plug-and-play module that can be flexibly integrated with existing foundation models to improve their reliability on OOD data through cost-effective post-training. Specifically, our method formulates the OOD detection problem as a preference optimization over the estimated OOD affinity between in-distribution (ID) and OOD samples, achieving this goal through a pairwise learning objective. We show that this objective essentially optimizes AUROC, which measures how consistently ID and OOD samples are ranked by the model. Extensive experiments across five real-world molecular datasets demonstrate that our approach significantly improves the OOD detection capabilities of existing molecular foundation models, achieving up to 45.8%, 43.9%, and 24.3% improvements in AUROC under distribution shifts of size, scaffold, and assay, respectively.

Can Molecular Foundation Models Know What They Don't Know? A Simple Remedy with Preference Optimization

TL;DR

This work tackles the instability of molecular foundation models on out-of-distribution molecules, including chemical hallucination, by introducing Mole-PAIR. Mole-PAIR is a lightweight, post-training detector that plugs into frozen molecular encoders and optimizes a pairwise ranking objective derived from Bradley–Terry theory, with a temperature parameter β, to align training with AUROC. The authors provide theoretical guarantees showing convergence to Bayes-optimal ranking and demonstrate that the method prioritizes hard and borderline ID–OOD pairs during learning. Empirically, Mole-PAIR yields substantial improvements over a suite of baselines on DrugOOD and GOOD benchmarks across multiple distribution shifts, using only feature-based OOD scores from a lightweight head attached to fixed encoders. The results support the practical utility of a model-agnostic, ranking-based reliability enhancement for molecular discovery pipelines.

Abstract

Molecular foundation models are rapidly advancing scientific discovery, but their unreliability on out-of-distribution (OOD) samples severely limits their application in high-stakes domains such as drug discovery and protein design. A critical failure mode is chemical hallucination, where models make high-confidence yet entirely incorrect predictions for unknown molecules. To address this challenge, we introduce Molecular Preference-Aligned Instance Ranking (Mole-PAIR), a simple, plug-and-play module that can be flexibly integrated with existing foundation models to improve their reliability on OOD data through cost-effective post-training. Specifically, our method formulates the OOD detection problem as a preference optimization over the estimated OOD affinity between in-distribution (ID) and OOD samples, achieving this goal through a pairwise learning objective. We show that this objective essentially optimizes AUROC, which measures how consistently ID and OOD samples are ranked by the model. Extensive experiments across five real-world molecular datasets demonstrate that our approach significantly improves the OOD detection capabilities of existing molecular foundation models, achieving up to 45.8%, 43.9%, and 24.3% improvements in AUROC under distribution shifts of size, scaffold, and assay, respectively.

Paper Structure

This paper contains 56 sections, 3 theorems, 38 equations, 6 figures, 5 tables.

Key Result

Proposition 4.1

Let $d_\phi=\nabla_\phi E_\phi(S_{\mathrm{out}})-\nabla_\phi E_\phi(S_{\mathrm{in}})$. A gradient step of size $\eta>0$ on Eq. eq:energy_dpo_loss_detailed changes the margin by where $\sigma(u)=(1+e^{-u})^{-1}$. The weight $\sigma(-\beta\,\Delta E_\phi)$ decreases with $\Delta E_\phi$, which means it is largest for misranked or borderline pairs and smallest for already separated pairs. The detail

Figures (6)

  • Figure 1: A case study illustrating the objective-metric misalignment. The figure plots the estimated OOD affinity scores yielded by the model trained with different objectives for ID and OOD samples on the IC50-Scaffold task. The Pairwise-Hinge loss joachims2002optimizing produces a globally separated score distribution between ID and OOD, aligning with AUROC, whereas the two pointwise objectives yield heavily overlapping scores due to their per-sample calibration loss. This clearly demonstrates the importance of objective-metric alignment for OOD detection.
  • Figure 2: Overview of the Mole-PAIR framework.
  • Figure 3: Test AUROC sensitivity to the temperature $\beta$ with $\lambda=0.01$. Different distribution shifts show distinct sensitivities: Assay prefers a medium $\beta$, Scaffold favors a larger $\beta$, while Size is largely insensitive to the choice of $\beta$.
  • Figure 4: Test AUROC sensitivity to the $\ell_2$ regularization $\lambda$ with $\beta=0.1$. Performance varies with regularization strength: Assay performs best with weak regularization, Scaffold benefits from a modest amount of regularization, while Size is robust until the regularization becomes too strong.
  • Figure 5: Training dynamics of Mole-PAIR across three distribution shifts. Each panel corresponds to one shift—(a) Assay, (b) Scaffold, (c) Size—and plots three metrics over 20 epochs: the misranked-pair proportion $\Pr(\Delta E_\phi<0)$ (left $y$-axis), the boundary mass $\Pr(|\Delta E_\phi|<\varepsilon)$ with $\varepsilon=0.05$ (left $y$-axis), and the average margin $\mathbb{E}[\Delta E_\phi]$ (right $y$-axis), where $\Delta E_\phi=E_\phi(S_{\mathrm{out}})-E_\phi(S_{\mathrm{in}})$. The rapid decrease of the first two curves and the steady increase of the margin illustrate that hard or borderline pairs are corrected first.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 4.1: Hard-pair emphasis
  • Lemma 4.2: Local Pairwise Optimality
  • Proposition 4.3: Global Convergence to the Bayes-Optimal Ranking