Table of Contents
Fetching ...

Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs

Lecheng Kong, Xiyuan Wang, Yixin Chen, Muhan Zhang

TL;DR

The paper tackles the challenge of round-trip consistency (RTC) in chemical LLMs by introducing Round-Trip Reinforcement Learning (RTRL), which uses a backward model's success at reconstructing inputs as a reward signal to train the forward model. The framework supports an iterative self-improvement loop where forward and backward mappings are alternately trained, even with unpaired data, enabling data-efficient training in chemistry where labeled pairs are scarce. Empirically, RTRL yields substantial gains in RTC and primary task performance across supervised, self-supervised, and synthetic regimes, with RTC improvements up to 52% and primary-task gains up to 55%. This approach demonstrates that RTC is a trainable objective that can produce more robust, credible chemical foundation models and open new paths for bidirectional reasoning in scientific AI.

Abstract

Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model's round-trip consistency and its performance on the primary tasks. This strong correlation reframes consistency into a direct target for model improvement. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly \textbf{boosts performance and consistency} over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models.

Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs

TL;DR

The paper tackles the challenge of round-trip consistency (RTC) in chemical LLMs by introducing Round-Trip Reinforcement Learning (RTRL), which uses a backward model's success at reconstructing inputs as a reward signal to train the forward model. The framework supports an iterative self-improvement loop where forward and backward mappings are alternately trained, even with unpaired data, enabling data-efficient training in chemistry where labeled pairs are scarce. Empirically, RTRL yields substantial gains in RTC and primary task performance across supervised, self-supervised, and synthetic regimes, with RTC improvements up to 52% and primary-task gains up to 55%. This approach demonstrates that RTC is a trainable objective that can produce more robust, credible chemical foundation models and open new paths for bidirectional reasoning in scientific AI.

Abstract

Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model's round-trip consistency and its performance on the primary tasks. This strong correlation reframes consistency into a direct target for model improvement. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly \textbf{boosts performance and consistency} over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models.

Paper Structure

This paper contains 16 sections, 13 equations, 2 figures, 12 tables.

Figures (2)

  • Figure 1: Left: Round-Trip Reinforcement Learning Pipeline. An input $x$ is mapped to an output $y$ by a forward prompt. We then compute the generation likelihood of $x$ given $y$ and a backward prompt. The likelihood is used as the reward for RL. Right: In iterative RTRL, we switch the forward and back prompts as well as the input and output domains to achieve mutual improvement.
  • Figure 2: Model performance progression in iterative RTRL. From left to right: (1) Retrosynthesis performance. Supervised. (2) Reaction prediction. Supervised. (3) CHEBI-20. Synthetic. (4) CHEBI-20. Synthetic. Iteration 0 means the base model.