Table of Contents
Fetching ...

MolLIBRA: Genetic Molecular Optimization with Multi-Fingerprint Surrogates and Text-Molecule Aligned Critic

Masahi Okada, Kazuki Sakai, Hiroaki Yoshida, Masaki Okoshi, Tadahiro Taniguchi

TL;DR

MolLIBRA tackles sample-efficient molecular optimization under a limited oracle budget by integrating a multimodal pre-evaluation framework into a genetic algorithm. It combines an ensemble of Gaussian process surrogates operating on multiple fingerprints with a text-molecule aligned CLAMP critic to produce a zero-shot scoring signal and robust candidate ranking before costly evaluations. The method adaptively gates among critics and leverages both structural fingerprints and language descriptions, achieving state-of-the-art Top-10 AUC on PMO-1K (14/22 tasks) and the highest total across tasks for MolLIBRA-L. The results underscore the value of representation-robust, language-informed priors in low-data regimes for drug design and point to future work on richer critics and broader fingerprints.

Abstract

We study sample-efficient molecular optimization under a limited budget of oracle evaluations. We propose MolLIBRA (MultimOdaLity and Language Integrated Bayesian and evolutionaRy optimizAtion), a genetic algorithm based framework that pre-ranks candidate molecules using multiple critics before oracle calls: (i) an ensemble of Gaussian process (GP) surrogates defined over multiple molecular fingerprints and (ii) a pretrained text-molecule aligned encoder CLAMP. The GP ensemble enables adaptive selection of task-appropriate fingerprints, while CLAMP provides a zero-shot scoring signal from task descriptions by measuring the similarity between molecular and text embeddings. On the Practical Molecular Optimization (PMO) benchmark with a budget of 1,000 evaluations (PMO-1K), MolLIBRA-L, our variant with a language-model-based candidate generator, attains the best Top-10 AUC on 14/22 tasks and the highest overall sum of Top-10 AUC across tasks among prior methods.

MolLIBRA: Genetic Molecular Optimization with Multi-Fingerprint Surrogates and Text-Molecule Aligned Critic

TL;DR

MolLIBRA tackles sample-efficient molecular optimization under a limited oracle budget by integrating a multimodal pre-evaluation framework into a genetic algorithm. It combines an ensemble of Gaussian process surrogates operating on multiple fingerprints with a text-molecule aligned CLAMP critic to produce a zero-shot scoring signal and robust candidate ranking before costly evaluations. The method adaptively gates among critics and leverages both structural fingerprints and language descriptions, achieving state-of-the-art Top-10 AUC on PMO-1K (14/22 tasks) and the highest total across tasks for MolLIBRA-L. The results underscore the value of representation-robust, language-informed priors in low-data regimes for drug design and point to future work on richer critics and broader fingerprints.

Abstract

We study sample-efficient molecular optimization under a limited budget of oracle evaluations. We propose MolLIBRA (MultimOdaLity and Language Integrated Bayesian and evolutionaRy optimizAtion), a genetic algorithm based framework that pre-ranks candidate molecules using multiple critics before oracle calls: (i) an ensemble of Gaussian process (GP) surrogates defined over multiple molecular fingerprints and (ii) a pretrained text-molecule aligned encoder CLAMP. The GP ensemble enables adaptive selection of task-appropriate fingerprints, while CLAMP provides a zero-shot scoring signal from task descriptions by measuring the similarity between molecular and text embeddings. On the Practical Molecular Optimization (PMO) benchmark with a budget of 1,000 evaluations (PMO-1K), MolLIBRA-L, our variant with a language-model-based candidate generator, attains the best Top-10 AUC on 14/22 tasks and the highest overall sum of Top-10 AUC across tasks among prior methods.
Paper Structure (37 sections, 5 equations, 5 figures, 9 tables, 5 algorithms)

This paper contains 37 sections, 5 equations, 5 figures, 9 tables, 5 algorithms.

Figures (5)

  • Figure 1: A conceptual illustration of MolLIBRA, a GA-based molecular optimization framework with multi-fingerprint surrogates and a text--molecule-aligned critic. MolLIBRA integrates two modalities for pre-evaluation: molecular fingerprints and natural-language task descriptions. The critics consist of learnable Gaussian process (GP) models defined over multiple fingerprints and a zero-shot critic based on a pretrained and frozen CLAMP model Ramsauer2023CLAMP. A critic is probabilistically selected for candidate ranking, and the selection probabilities are updated using the newly observed oracle scores.
  • Figure 2: Heatmap visualizing the contribution of critics (structured-space GPs and the CLAMP critic) in MolLIBRA-$\mathcal{L}$'s optimization process. The color intensity indicates the accumulation of step-wise improvement in oracle scores realized by each critic. In the figure, the contributions are normalized so that the total contribution of all critics sums to 100%. Similar results for MolLIBRA-$\mathcal{G}$ are provided in Appendix Figure \ref{['fig:modal_contrib_mollibrag']}.
  • Figure 3: Temporal evolution of contributions in four tasks (results from a single seed run). The cumulative score improvement realized by each critic is shown as an area chart.
  • Figure 4: Heatmap of critic contributions during the optimization process of MolLIBRA-$\mathcal{G}$. Compared to MolLIBRA-$\mathcal{L}$ shown in Figure \ref{['fig:modal_contrib_mollibral']}, the dominant critics for each task are generally consistent.
  • Figure 5: Heatmaps showing the critic contributions for different seeds. While amlodipine_mpo shows consistent critic contributions across seeds, fexofenadine_mpo exhibits large variation depending on the seed.