Table of Contents
Fetching ...

Training a Scientific Reasoning Model for Chemistry

Siddharth M. Narayanan, James D. Braza, Ryan-Rhys Griffiths, Albert Bou, Geemi Wellawatte, Mayk Caldas Ramos, Ludovico Mitchener, Samuel G. Rodriques, Andrew D. White

TL;DR

The paper demonstrates that a reasoning-capable language model can be effectively trained for chemistry through reinforcement learning and distillation, achieving strong performance on open-ended molecular design tasks with substantially less domain-specific pretraining data. By combining long-chain reasoning, task-specific RL, and a generalist distillation stage, ether0 delivers superior results to domain-specific and frontier models while maintaining data efficiency. The approach leverages verifiable rewards, problem rewriting, and curriculum strategies to promote robust reasoning and reduce failure modes, with safety-aligned RL as a final step. These findings suggest a scalable path to data-efficient, reasoning-driven models across diverse scientific domains, and the authors provide open access to resources to enable replication and further development.

Abstract

Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.

Training a Scientific Reasoning Model for Chemistry

TL;DR

The paper demonstrates that a reasoning-capable language model can be effectively trained for chemistry through reinforcement learning and distillation, achieving strong performance on open-ended molecular design tasks with substantially less domain-specific pretraining data. By combining long-chain reasoning, task-specific RL, and a generalist distillation stage, ether0 delivers superior results to domain-specific and frontier models while maintaining data efficiency. The approach leverages verifiable rewards, problem rewriting, and curriculum strategies to promote robust reasoning and reduce failure modes, with safety-aligned RL as a final step. These findings suggest a scalable path to data-efficient, reasoning-driven models across diverse scientific domains, and the authors provide open access to resources to enable replication and further development.

Abstract

Reasoning models are large language models that emit a long chain-of-thought before answering, providing both higher accuracy and explicit reasoning for their response. A major question has been whether language model reasoning generalizes beyond mathematics, programming, and logic, where most previous work has focused. We demonstrate that reasoning models can be post-trained for chemistry without additional domain pretraining, and require substantially less data compared to contemporary domain-specific models. We report ether0, a 24B parameter LLM (based on Mistral-Small-24B) that can reason in natural language and respond with chemical structures. This reasoning model was trained with reinforcement learning on 640,730 experimentally-grounded chemistry problems across 375 tasks ranging from synthesizability, to blood-brain barrier permeability, to human receptor activity, to scent. Our model exceeds general-purpose chemistry models, frontier models, and human experts on molecular design tasks. It is also more data efficient relative to specialized models. We anticipate that this method can be applied to train data-efficient language models specialized for tasks across a wide variety of scientific domains.

Paper Structure

This paper contains 48 sections, 7 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: An overview of the training methodology and an example reasoning trace for ether0. Training stages are shown in the bottom panel where the accuracy per step is scaled to have the same x-axis range (see \ref{['sec:hypers']}).
  • Figure 2: Per-task performance of our model compared to general-purpose LLMs. For multiple choice tasks, the "random" line accounts for varying numbers of options between problems. The human bar is an average of four chemists equipped with only the molecule drawing tool ChemDraw. Humans were not evaluated on receptor binding and scent tasks, as the structure-property relationship is mostly unknown, making these tasks essentially impossible without additional tools.
  • Figure 3: Data efficiency analysis. (A) Comparison of ether0 to Molecular Transformer (MT) on reaction prediction: ether0 outperforms the published MT (dashed line) and shows higher data efficiency compared to retraining MT from scratch on our dataset ($\dag$ - retrained). (B) Effect of in-context learning (ICL) on multiple-choice questions (MCQs).
  • Figure 4: Annotated reasoning trace of the model correctly solving an unseen structure elucidation task, where o3, r1, Gemini 2.5-pro 05-07-25, and GPT-4.5 fail. The trace illustrates exploration, backtracking, and verification. The model does not know the real molecule name (azaleatin), referring to it as quercetin-C to indicate quercetin with an extra methyl group. Overall, this trace highlights both the strengths and limitations of ether0's learned capabilities in complex, multi-step chemical tasks.
  • Figure 5: Left: Per-task performance of reasoning and non-reasoning models. Right: Evolution of model reasoning behaviors on the evaluation set throughout training, across three problem categories: functional group, reaction prediction, and SMILES completion. We track 4 reasoning behaviors: backtracking, backward chaining, subgoal setting, and verification, alongside completion length.
  • ...and 7 more figures