Table of Contents
Fetching ...

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

Yuyang Wu, Jinhui Ye, Shuhao Zhang, Lu Dai, Yonatan Bisk, Olexandr Isayev

TL;DR

MolErr2Fix tackles the need for trustworthy chemical reasoning in LLMs by introducing a four-stage benchmark—error detection, localization, explanation, and revision—for fine-grained error handling in molecular descriptions. It assembles 1,193 annotated errors across 525 molecules via expert annotation and provides a comprehensive evaluation protocol with multiple metrics, including $Precison$, $Recall$, $F_1$, $IoU$, $BLEU$, and $GPT$-Score. Baseline experiments show that while some models excel at detection, locating, explaining, and revising chemical errors remains challenging, especially for revision, highlighting gaps between fluent text and chemically valid reasoning. The work advocates chemistry-centric pretraining, self-reflection loops for iterative debugging, and broader benchmark coverage to push toward more reliable, chemically informed LLMs.

Abstract

Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e,. (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Revision

TL;DR

MolErr2Fix tackles the need for trustworthy chemical reasoning in LLMs by introducing a four-stage benchmark—error detection, localization, explanation, and revision—for fine-grained error handling in molecular descriptions. It assembles 1,193 annotated errors across 525 molecules via expert annotation and provides a comprehensive evaluation protocol with multiple metrics, including , , , , , and -Score. Baseline experiments show that while some models excel at detection, locating, explaining, and revising chemical errors remains challenging, especially for revision, highlighting gaps between fluent text and chemically valid reasoning. The work advocates chemistry-centric pretraining, self-reflection loops for iterative debugging, and broader benchmark coverage to push toward more reliable, chemically informed LLMs.

Abstract

Large Language Models (LLMs) have shown growing potential in molecular sciences, but they often produce chemically inaccurate descriptions and struggle to recognize or justify potential errors. This raises important concerns about their robustness and reliability in scientific applications. To support more rigorous evaluation of LLMs in chemical reasoning, we present the MolErr2Fix benchmark, designed to assess LLMs on error detection and correction in molecular descriptions. Unlike existing benchmarks focused on molecule-to-text generation or property prediction, MolErr2Fix emphasizes fine-grained chemical understanding. It tasks LLMs with identifying, localizing, explaining, and revising potential structural and semantic errors in molecular descriptions. Specifically, MolErr2Fix consists of 1,193 fine-grained annotated error instances. Each instance contains quadruple annotations, i.e,. (error type, span location, the explanation, and the correction). These tasks are intended to reflect the types of reasoning and verification required in real-world chemical communication. Evaluations of current state-of-the-art LLMs reveal notable performance gaps, underscoring the need for more robust chemical reasoning capabilities. MolErr2Fix provides a focused benchmark for evaluating such capabilities and aims to support progress toward more reliable and chemically informed language models. All annotations and an accompanying evaluation API will be publicly released to facilitate future research.

Paper Structure

This paper contains 45 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (a) and (b) indicate that the molecular caption generated by LLMs exhibits many errors, even though it has high BLEU and ROUGE scores against the ground truth. (c) indicates LLMs fail to detect errors.
  • Figure 2: Annotation pipeline of the MolErr2Fix. (a) Problematic molecular candidate captions generation using standardized prompts across multiple LLMs with ChEBI-20 SMILES. (b) Expert annotation process involves four steps: error localization, classification, explanation, and correction, based on expert-defined taxonomies and reference tools, ensuring chemical accuracy in molecular descriptions.
  • Figure 3: Error distribution of six chemical error types in the outputs of five advanced LLMs.
  • Figure 4: Error detection performance of GPT-4o across six chemical error types in the MolErr2Fix benchmark.
  • Figure 5: Localization performance of GPT-4o across six chemical error types in the MolErr2Fix benchmark.
  • ...and 2 more figures