Table of Contents
Fetching ...

Test-Time Training Scaling Laws for Chemical Exploration in Drug Design

Morgan Thomas, Albert Bou, Gianni De Fabritiis

TL;DR

The paper tackles the challenge of thoroughly exploring vast drug-like chemical space with Chemical Language Models (CLMs) trained via reinforcement learning, where mode collapse can hinder exploration. It introduces MolExp, a benchmark that requires rediscovery of structurally diverse molecules with similar bioactivity and demonstrates that test-time training scaling—via increasing the number of independent RL agents—produces a log-linear gain in exploration efficiency. Cooperative multi-agent RL strategies show limited improvements for targeted exploration and often trade off diversity for performance. Together, these findings establish MolExp as a practical framework for scalable, exploration-focused AI-driven drug discovery and highlight population-based TTT as a promising path to comprehensive chemical-space exploration.

Abstract

Chemical Language Models (CLMs) leveraging reinforcement learning (RL) have shown promise in de novo molecular design, yet often suffer from mode collapse, limiting their exploration capabilities. Inspired by Test-Time Training (TTT) in large language models, we propose scaling TTT for CLMs to enhance chemical space exploration. We introduce MolExp, a novel benchmark emphasizing the discovery of structurally diverse molecules with similar bioactivity, simulating real-world drug design challenges. Our results demonstrate that scaling TTT by increasing the number of independent RL agents follows a log-linear scaling law, significantly improving exploration efficiency as measured by MolExp. In contrast, increasing TTT training time yields diminishing returns, even with exploration bonuses. We further evaluate cooperative RL strategies to enhance exploration efficiency. These findings provide a scalable framework for generative molecular design, offering insights into optimizing AI-driven drug discovery.

Test-Time Training Scaling Laws for Chemical Exploration in Drug Design

TL;DR

The paper tackles the challenge of thoroughly exploring vast drug-like chemical space with Chemical Language Models (CLMs) trained via reinforcement learning, where mode collapse can hinder exploration. It introduces MolExp, a benchmark that requires rediscovery of structurally diverse molecules with similar bioactivity and demonstrates that test-time training scaling—via increasing the number of independent RL agents—produces a log-linear gain in exploration efficiency. Cooperative multi-agent RL strategies show limited improvements for targeted exploration and often trade off diversity for performance. Together, these findings establish MolExp as a practical framework for scalable, exploration-focused AI-driven drug discovery and highlight population-based TTT as a promising path to comprehensive chemical-space exploration.

Abstract

Chemical Language Models (CLMs) leveraging reinforcement learning (RL) have shown promise in de novo molecular design, yet often suffer from mode collapse, limiting their exploration capabilities. Inspired by Test-Time Training (TTT) in large language models, we propose scaling TTT for CLMs to enhance chemical space exploration. We introduce MolExp, a novel benchmark emphasizing the discovery of structurally diverse molecules with similar bioactivity, simulating real-world drug design challenges. Our results demonstrate that scaling TTT by increasing the number of independent RL agents follows a log-linear scaling law, significantly improving exploration efficiency as measured by MolExp. In contrast, increasing TTT training time yields diminishing returns, even with exploration bonuses. We further evaluate cooperative RL strategies to enhance exploration efficiency. These findings provide a scalable framework for generative molecular design, offering insights into optimizing AI-driven drug discovery.

Paper Structure

This paper contains 20 sections, 5 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematic representation of chemical space manifold according to molecular desirability, where energy wells are desirable sub-spaces of 'high-reward' areas according to a given objective. (Left) Typical objectives in the literature. (Right) Our proposal objective. The sphere represents a generative model learning to optimize desirability or maximize reward. Target rediscovery where the reward reflects maximal similarity to target molecule in chemical space. Target similarity where the reward reflects similarity above a certain threshold to a target molecule in chemical space, perhaps with additional multi-parameter objectives. Predicted property of molecules in chemical space, where multiple regions of reward exist. Lastly, our proposal, multiple target rediscovery where there exists different and distinct regions of high reward chemical space, and the goal is to rediscover all target molecules.
  • Figure 2: Baseline performance for each MolExpL task during RL training, single replicate. Each line color represents similarity to a target molecule. Note that ACEGEN and REINVENT$_{MolOpt}$ methods outperform due to their enhanced ability to optimize similarity to at-least one target molecule. Interestingly REINVENT$_{MolOpt}$ optimizes similarity to a different target molecule than ACEGEN for the EGFR task, despite identical initial policy parameterization.
  • Figure 3: Performance on the MolExpL benchmark with scaling. (a) Scaling the number of independent ACEGEN$_{MolOpt}$ and REINFORCE RL agents, each with a budget of 10,000. (b) Scaling the total budget allocated to a single agent, a single agent with a RND exploration bonus, and a single agent with a DF. (c) The diversity of sampled compounds as measured by sphere exclusion diversity.
  • Figure 4: Example molecules generated compared to the set of targets in the MolExpL A2A task. The maximum similarity score and corresponding agent $k$ is labeled. Note that only with 87 agents is the task solved.
  • Figure 5: Performance comparison of (a) MolExpL Score and (b) Molecular diversity of different 4-agent cooperative strategies on the MolExpL benchmark, each with a budget of 10,000. The dashed line represents the average of 4 independent agents as baseline.
  • ...and 1 more figures