Table of Contents
Fetching ...

Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX

Anastasia Vepreva, Julia Razlivina, Maria Eremeeva, Nina Gubina, Anastasia Orlova, Aleksei Dmitrenko, Ksenya Kapranova, Susan Jyakhwo, Nikita Vasilev, Arsen Sarkisyan, Ivan Yu. Chernyshov, Vladimir Vinogradov, Andrei Dmitrenko

TL;DR

ChemX addresses the challenge of automated chemical information extraction by introducing a rigorous, multimodal benchmark of 10 curated datasets across nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems and baselines, and introduces a reproducible single-agent preprocessing pipeline to improve extraction quality. The study finds persistent gaps in extracting domain-specific terminology and complex structures (e.g., SMILES) and shows that structured preprocessing can boost recall, while general agentic approaches struggle to generalize beyond specific datasets. By providing a diverse, expert-validated resource and a clear evaluation protocol, ChemX lays the groundwork for advancing automated information extraction in chemistry and improving cross-domain generalization.

Abstract

The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.

Benchmarking Agentic Systems in Automated Scientific Information Extraction with ChemX

TL;DR

ChemX addresses the challenge of automated chemical information extraction by introducing a rigorous, multimodal benchmark of 10 curated datasets across nanomaterials and small molecules. It systematically evaluates state-of-the-art agentic systems and baselines, and introduces a reproducible single-agent preprocessing pipeline to improve extraction quality. The study finds persistent gaps in extracting domain-specific terminology and complex structures (e.g., SMILES) and shows that structured preprocessing can boost recall, while general agentic approaches struggle to generalize beyond specific datasets. By providing a diverse, expert-validated resource and a clear evaluation protocol, ChemX lays the groundwork for advancing automated information extraction in chemistry and improving cross-domain generalization.

Abstract

The emergence of agent-based systems represents a significant advancement in artificial intelligence, with growing applications in automated data extraction. However, chemical information extraction remains a formidable challenge due to the inherent heterogeneity of chemical data. Current agent-based approaches, both general-purpose and domain-specific, exhibit limited performance in this domain. To address this gap, we present ChemX, a comprehensive collection of 10 manually curated and domain-expert-validated datasets focusing on nanomaterials and small molecules. These datasets are designed to rigorously evaluate and enhance automated extraction methodologies in chemistry. To demonstrate their utility, we conduct an extensive benchmarking study comparing existing state-of-the-art agentic systems such as ChatGPT Agent and chemical-specific data extraction agents. Additionally, we introduce our own single-agent approach that enables precise control over document preprocessing prior to extraction. We further evaluate the performance of modern baselines, such as GPT-5 and GPT-5 Thinking, to compare their capabilities with agentic approaches. Our empirical findings reveal persistent challenges in chemical information extraction, particularly in processing domain-specific terminology, complex tabular and schematic representations, and context-dependent ambiguities. The ChemX benchmark serves as a critical resource for advancing automated information extraction in chemistry, challenging the generalization capabilities of existing methods, and providing valuable insights into effective evaluation strategies.

Paper Structure

This paper contains 19 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: ChemX. This pipeline includes manual collection of multimodal data from scientific articles, further validation by domain experts and benchmarking automated data extraction.
  • Figure 2: Quality control process for ChemX datasets
  • Figure 3: Quality control process for ChemX datasets