Table of Contents
Fetching ...

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji

TL;DR

The paper introduces oMeBench, a rigorous, expert-curated benchmark for organic mechanism reasoning, paired with the dynamic oMeS evaluation framework to quantify mechanistic fidelity beyond product prediction. By organizing data into Gold, Template, and Silver sets, and by leveraging a robust alignment-based scoring scheme, the authors reveal that current LLMs struggle to sustain multi-step causal reasoning and chemical consistency, despite surface-level plausibility. Through exemplar-based prompting and supervised fine-tuning on the Silver dataset, they demonstrate substantial performance gains, including up to a 50% improvement over leading proprietary models. The work provides a foundational benchmark and insights that guide future development of chemically grounded reasoning in AI systems for reaction mechanism elucidation.

Abstract

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning

TL;DR

The paper introduces oMeBench, a rigorous, expert-curated benchmark for organic mechanism reasoning, paired with the dynamic oMeS evaluation framework to quantify mechanistic fidelity beyond product prediction. By organizing data into Gold, Template, and Silver sets, and by leveraging a robust alignment-based scoring scheme, the authors reveal that current LLMs struggle to sustain multi-step causal reasoning and chemical consistency, despite surface-level plausibility. Through exemplar-based prompting and supervised fine-tuning on the Silver dataset, they demonstrate substantial performance gains, including up to a 50% improvement over leading proprietary models. The work provides a foundational benchmark and insights that guide future development of chemically grounded reasoning in AI systems for reaction mechanism elucidation.

Abstract

Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.

Paper Structure

This paper contains 75 sections, 12 equations, 9 figures, 17 tables, 1 algorithm.

Figures (9)

  • Figure 1: Given a reaction from oMeBench, we generate a reaction mechanism and score it with oMeS.
  • Figure 2: Overview of dataset construction and examples. A named reaction refers to a class of reactions that share a common mechanistic pattern and can be abstracted into a generalized template. Templates are denoted using placeholders such as “R-groups,” which represent variable substituents (e.g., Me, Et, H)
  • Figure 3: Overview of the dataset metadata and format.
  • Figure 4: The overview of the evaluation system. LLM-generated mechanisms are dynamically aligned with gold references. Subtype correctness, molecular similarity $\sigma$, and weight $W$ are used to compute oMeS $S$, $V$, and $L$.
  • Figure 5: Performance of LLMs on oMeBench across difficulty levels. Frontier models outperform others but all degrade on harder reactions, while chemistry-specific models lag despite domain pretraining
  • ...and 4 more figures