Table of Contents
Fetching ...

Amortized Molecular Optimization via Group Relative Policy Optimization

Muhammad bin Javaid, Hasham Hussain, Ashima Khanna, Berke Kisin, Jonathan Pirnay, Alexander Mitsos, Dominik G. Grimm, Martin Grohe

TL;DR

This work tackles the scalability gap between instance optimization and amortized molecular design by addressing high variance across starting structures with Group Relative Policy Optimization (GRPO). The authors introduce GRXForm, a Graph Transformer policy that constructs molecules through atom-and-bond additions under structural constraints, amortized across tasks. GRPO normalizes rewards within groups of trajectories per starting structure, stabilizing learning and enabling fast, generalizable optimization without inference-time oracle calls. Empirical results across kinase scaffold decoration, prodrug transfer, and PMO benchmarks show GRXForm is competitive with top instance optimizers and substantially more efficient at scale. The approach promises practical impact for high-throughput molecular design, offering a scalable alternative to iterative search while preserving multi-objective performance.

Abstract

Molecular design encompasses tasks ranging from de-novo design to structural alteration of given molecules or fragments. For the latter, state-of-the-art methods predominantly function as "Instance Optimizers'', expending significant compute restarting the search for every input structure. While model-based approaches theoretically offer amortized efficiency by learning a policy transferable to unseen structures, existing methods struggle to generalize. We identify a key failure mode: the high variance arising from the heterogeneous difficulty of distinct starting structures. To address this, we introduce GRXForm, adapting a pre-trained Graph Transformer model that optimizes molecules via sequential atom-and-bond additions. We employ Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning to mitigate variance by normalizing rewards relative to the starting structure. Empirically, GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.

Amortized Molecular Optimization via Group Relative Policy Optimization

TL;DR

This work tackles the scalability gap between instance optimization and amortized molecular design by addressing high variance across starting structures with Group Relative Policy Optimization (GRPO). The authors introduce GRXForm, a Graph Transformer policy that constructs molecules through atom-and-bond additions under structural constraints, amortized across tasks. GRPO normalizes rewards within groups of trajectories per starting structure, stabilizing learning and enabling fast, generalizable optimization without inference-time oracle calls. Empirical results across kinase scaffold decoration, prodrug transfer, and PMO benchmarks show GRXForm is competitive with top instance optimizers and substantially more efficient at scale. The approach promises practical impact for high-throughput molecular design, offering a scalable alternative to iterative search while preserving multi-objective performance.

Abstract

Molecular design encompasses tasks ranging from de-novo design to structural alteration of given molecules or fragments. For the latter, state-of-the-art methods predominantly function as "Instance Optimizers'', expending significant compute restarting the search for every input structure. While model-based approaches theoretically offer amortized efficiency by learning a policy transferable to unseen structures, existing methods struggle to generalize. We identify a key failure mode: the high variance arising from the heterogeneous difficulty of distinct starting structures. To address this, we introduce GRXForm, adapting a pre-trained Graph Transformer model that optimizes molecules via sequential atom-and-bond additions. We employ Group Relative Policy Optimization (GRPO) for goal-directed fine-tuning to mitigate variance by normalizing rewards relative to the starting structure. Empirically, GRXForm generalizes to out-of-distribution molecular scaffolds without inference-time oracle calls or refinement, achieving scores in multi-objective optimization competitive with leading instance optimizers.
Paper Structure (50 sections, 8 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 50 sections, 8 equations, 5 figures, 9 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of optimization paradigms. (A) Instance Optimization: Requires an expensive iterative search with thousands of oracle calls for every new input structure $S_i$, resulting in high cost that scales linearly with library size. (B) Amortized Optimization: Front-loads computation into offline training. The learned policy $\pi_\theta$ generates optimized molecules for new, unseen inputs $S_i$ in a single forward pass without inference-time oracle calls, enabling scalable, high-throughput design.
  • Figure 2: Overview of the GRXForm policy architecture and training mechanism.
  • Figure 3: Comparison of advantage estimation strategies. (A) Global Baseline (REINFORCE) fails to account for heterogeneous scaffold difficulty, leading to biased gradients. (B) Group-Relative Baseline (GRPO) uses instance-specific group means ($\mu_A, \mu_B$) to normalize rewards, stabilizing the learning signal across both easy and hard tasks.
  • Figure 4: Advantage Signal Stability. Comparison of mean advantage during training. The global baseline (REINFORCE, pink) exhibits high-magnitude variance due to heterogeneous scaffold difficulty, destabilizing the gradient. In contrast, GRPO (green) mitigates this via instance-specific normalization, yielding a stable learning signal.
  • Figure 5: Structural Generalization Split. t-SNE visualization of the chemical space (Morgan Fingerprints) for Training, Validation, and Test scaffolds. The cluster-based splitting strategy ensures that Test scaffolds (purple) occupy distinct regions of chemical space compared to the Training set (blue) and Validation set (orange), enforcing a test of out-of-distribution generalization.