Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition
Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao
TL;DR
The paper reframes medicinal-chemistry analog design around matched molecular pair transformations (MMPTs) and introduces MMPT-FM, a promptable foundation model trained on large-scale MMPT data to generate context-appropriate variable substitutions. It further couples this with MMPT-RAG, a retrieval-augmented generation framework that uses external reference analogs, cluster templates, and MCS-based prompts to steer generation toward project-relevant transformation patterns. Across in-distribution, within-patent, and cross-patent benchmarks, the approach improves recall of ground-truth MMPTs and maintains high validity while boosting novelty, demonstrating effective transformation-level priors and practical usefulness for discovery. The work offers a scalable, interpretable abstraction for controlled analog design, enabling chemists to leverage large-scale data without retraining while aligning outputs with specific series or patents.
Abstract
Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.
