Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Xuan Li; Zhanke Zhou; Zongze Li; Jiangchao Yao; Yu Rong; Lu Zhang; Bo Han

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Xuan Li, Zhanke Zhou, Zongze Li, Jiangchao Yao, Yu Rong, Lu Zhang, Bo Han

TL;DR

Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data, consistently outperforms SFT and RLVR baselines and achieves improvements on the optimization metric.

Abstract

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model's lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy's intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate $\times$ Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

TL;DR

Abstract

Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePO.

Paper Structure (39 sections, 4 equations, 17 figures, 11 tables)

This paper contains 39 sections, 4 equations, 17 figures, 11 tables.

Introduction
Preliminaries
Supervision Mismatch Under Competing Objectives
RePO: Reference-guided Policy Optimization
Experiments
Experiment Settings
Quantitative Results
Mechanism and Robustness Analyses
Case Studies
Conclusion
Ethic Statement
Reproduction Statement
LLM Usage Disclosure
Impact Statement
Limitations
...and 24 more sections

Figures (17)

Figure 1: Molecular optimization aims to optimize the given molecule by modifying its components while maintaining the structural similarity of the original molecule after modification. The molecule is presented as SMILES weininger1988smiles, a sequence of symbols representing atoms and bonds.
Figure 2: Illustration of RePO. The model generates answers via reasoning; reference guidance anchors to the reference conditioned on the reasoning context, while RLVR optimizes the property under similarity constraints.
Figure 3: Performance comparison on molecular optimization tasks. Details of molecular properties can be found in Appendix \ref{['appendix: metrics']}.
Figure 4: Average success rate and similarity for SFT, GRPO, GRPO (SFT-init), and RePO on property optimization. GRPO achieves high similarity but a low success rate. RePO improves success while maintaining high similarity.
Figure 5: Training dynamics of completion lengths of different methods. GRPO (SFT-init) generates short responses that match the SFT data.
...and 12 more figures

Theorems & Definitions (1)

Remark 4.1

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

TL;DR

Abstract

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (17)

Theorems & Definitions (1)