Can Molecular Foundation Models Know What They Don't Know? A Simple Remedy with Preference Optimization
Langzhou He, Junyou Zhu, Fangxin Wang, Junhua Liu, Haoyan Xu, Yue Zhao, Philip S. Yu, Qitian Wu
TL;DR
This work tackles the instability of molecular foundation models on out-of-distribution molecules, including chemical hallucination, by introducing Mole-PAIR. Mole-PAIR is a lightweight, post-training detector that plugs into frozen molecular encoders and optimizes a pairwise ranking objective derived from Bradley–Terry theory, with a temperature parameter β, to align training with AUROC. The authors provide theoretical guarantees showing convergence to Bayes-optimal ranking and demonstrate that the method prioritizes hard and borderline ID–OOD pairs during learning. Empirically, Mole-PAIR yields substantial improvements over a suite of baselines on DrugOOD and GOOD benchmarks across multiple distribution shifts, using only feature-based OOD scores from a lightweight head attached to fixed encoders. The results support the practical utility of a model-agnostic, ranking-based reliability enhancement for molecular discovery pipelines.
Abstract
Molecular foundation models are rapidly advancing scientific discovery, but their unreliability on out-of-distribution (OOD) samples severely limits their application in high-stakes domains such as drug discovery and protein design. A critical failure mode is chemical hallucination, where models make high-confidence yet entirely incorrect predictions for unknown molecules. To address this challenge, we introduce Molecular Preference-Aligned Instance Ranking (Mole-PAIR), a simple, plug-and-play module that can be flexibly integrated with existing foundation models to improve their reliability on OOD data through cost-effective post-training. Specifically, our method formulates the OOD detection problem as a preference optimization over the estimated OOD affinity between in-distribution (ID) and OOD samples, achieving this goal through a pairwise learning objective. We show that this objective essentially optimizes AUROC, which measures how consistently ID and OOD samples are ranked by the model. Extensive experiments across five real-world molecular datasets demonstrate that our approach significantly improves the OOD detection capabilities of existing molecular foundation models, achieving up to 45.8%, 43.9%, and 24.3% improvements in AUROC under distribution shifts of size, scaffold, and assay, respectively.
