Table of Contents
Fetching ...

LitXBench: A Benchmark for Extracting Experiments from Scientific Literature

Curtis Chong, Jorge Colindres

Abstract

Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.

LitXBench: A Benchmark for Extracting Experiments from Scientific Literature

Abstract

Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.

Paper Structure

This paper contains 14 sections, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Pareto front of experiment extraction methods.
  • Figure 2: LitXBench Principles for Accurate Extraction and Benchmarking. (1) To accurately capture a material's properties, measurements must be linked to its processing lineage, rather than just its composition. (2) Categorical values should be mapped to canonical identifiers to disambiguate similar values, as multiple papers may reference different properties with the same term. (3) Extracted materials are more editable and auditable when represented as code, reducing errors in the benchmark.
  • Figure 3: Schema of extracted materials in LitXAlloy. Each material is identified by its process steps, which are outlined by the arrow notation. Measurements performed on the material follow. CompMeasurements are various composition measurements performed on the sample. Configuration measurements correlate to microstructure and other features typically visible through an electron microscope. Further schema specification details are in \ref{['appendix:data_models']}.
  • Figure 4: Helper function which calculates a composition in which the weights of the additions are based on the weight of the base alloy. Since the code is self-documenting, this preserves the interpretability of the composition reported in the paper when compared to its nominal composition of $\mathrm{CoCrFeNiW}_{0.12}\mathrm{C}_{0.12}$
  • Figure 5: Definition of each Synthesis Group. Each material defines which group of synthesis events it undergoes through the arrow notation group1$\rightarrow$group2. Groups that accept parameters (such as Hours) enable annotators to reuse synthesis groups across materials that differ by slight experimental parameters.
  • ...and 1 more figures