Table of Contents
Fetching ...

Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition

Bo Pan, Peter Zhiping Zhang, Hao-Wei Pang, Alex Zhu, Xiang Yu, Liying Zhang, Liang Zhao

TL;DR

The paper reframes medicinal-chemistry analog design around matched molecular pair transformations (MMPTs) and introduces MMPT-FM, a promptable foundation model trained on large-scale MMPT data to generate context-appropriate variable substitutions. It further couples this with MMPT-RAG, a retrieval-augmented generation framework that uses external reference analogs, cluster templates, and MCS-based prompts to steer generation toward project-relevant transformation patterns. Across in-distribution, within-patent, and cross-patent benchmarks, the approach improves recall of ground-truth MMPTs and maintains high validity while boosting novelty, demonstrating effective transformation-level priors and practical usefulness for discovery. The work offers a scalable, interpretable abstraction for controlled analog design, enabling chemists to leverage large-scale data without retraining while aligning outputs with specific series or patents.

Abstract

Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.

Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition

TL;DR

The paper reframes medicinal-chemistry analog design around matched molecular pair transformations (MMPTs) and introduces MMPT-FM, a promptable foundation model trained on large-scale MMPT data to generate context-appropriate variable substitutions. It further couples this with MMPT-RAG, a retrieval-augmented generation framework that uses external reference analogs, cluster templates, and MCS-based prompts to steer generation toward project-relevant transformation patterns. Across in-distribution, within-patent, and cross-patent benchmarks, the approach improves recall of ground-truth MMPTs and maintains high validity while boosting novelty, demonstrating effective transformation-level priors and practical usefulness for discovery. The work offers a scalable, interpretable abstraction for controlled analog design, enabling chemists to leverage large-scale data without retraining while aligning outputs with specific series or patents.

Abstract

Matched molecular pairs (MMPs) capture the local chemical edits that medicinal chemists routinely use to design analogs, but existing ML approaches either operate at the whole-molecule level with limited edit controllability or learn MMP-style edits from restricted settings and small models. We propose a variable-to-variable formulation of analog generation and train a foundation model on large-scale MMP transformations (MMPTs) to generate diverse variables conditioned on an input variable. To enable practical control, we develop prompting mechanisms that let the users specify preferred transformation patterns during generation. We further introduce MMPT-RAG, a retrieval-augmented framework that uses external reference analogs as contextual guidance to steer generation and generalize from project-specific series. Experiments on general chemical corpora and patent-specific datasets demonstrate improved diversity, novelty, and controllability, and show that our method recovers realistic analog structures in practical discovery scenarios.
Paper Structure (34 sections, 1 theorem, 8 equations, 7 figures, 3 tables)

This paper contains 34 sections, 1 theorem, 8 equations, 7 figures, 3 tables.

Key Result

theorem 1

Let $p_{\theta}(y \mid x)$ be the conditional distribution over variables $y \in \mathcal{V}$ defined by the unconstrained foundation model. Assume that for each cluster $k$, prompting the model with template $T_k$ (via masked infilling) results in a local distribution $p(y \mid x, T_k)$ that is an where $\alpha_k \in (0,1]$ is an adaptive gating factor reflecting the model's adherence to templat

Figures (7)

  • Figure 1: An example of (a) Matched Molecular Pairs (MMP); (b) Matched Molecular Pair Transformation (MMPT) and its textual representation.
  • Figure 2: Overview of the proposed MMPT framework. (a) The foundation model (MMPT-FM) is trained on large-scale MMPT data. (b) MMPT-FM supports controllable generation via masked template prompting. (c) MMPT-RAG augments generation with retrieval, clustering, and MCS-based template extraction to guide context-aware transformation generation.
  • Figure 3: Visualizations of the chemical space explored by our foundation model MMPT-FM (blue) versus Database Retrieval (red) on (a) ChEMBL and (b) PMV17 datasets.
  • Figure 4: UMAP visualization of MMPT-FM and MMPT-RAG's chemical landscape on PMV17. The grey shaded areas represent the reference dataset's distribution. Compared to FM inference (blue), the MMP-RAG framework (red) populates structural voids where the foundation model is sparse or absent.
  • Figure 5: Hyperparameter Study. (a) the number of clusters to expand, (b) the number of variables to generate for each cluster, (c) the range of mask length to fill.
  • ...and 2 more figures

Theorems & Definitions (1)

  • theorem 1: Global Steering