Table of Contents
Fetching ...

LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

Magdalena Lederbauer, Siddharth Betala, Xiyao Li, Ayush Jain, Amine Sehaba, Georgia Channing, Grégoire Germain, Anamaria Leonescu, Faris Flaifil, Alfonso Amayuelas, Alexandre Nozadze, Stefan P. Schmid, Mohd Zaki, Sudheesh Kumar Ethirajan, Elton Pan, Mathilde Franckel, Alexandre Duval, N. M. Anoop Krishnan, Samuel P. Gleason

TL;DR

This work tackles the fragmentation of inorganic synthesis knowledge by introducing LeMat-Synth, a multi-modal framework that uses LLMs and VLMs to automatically extract and structure synthesis procedures and performance data from a large corpus of open literature. An ontology of 35 synthesis methods and 16 material classes underpins a scalable pipeline that merges text and figure analysis, producing a machine-readable dataset (LeMat-Synth v1.0) from 81k papers and enabling data-driven synthesis planning and synthesis–structure–property modeling. The authors validate extraction quality with expert annotations and a scalable LLM-as-a-judge framework, while releasing an open-source software stack to extend the dataset to new domains. Although open-access bias and extraction limitations remain, this infrastructure establishes a foundation for predictive materials science and autonomous discovery workflows.

Abstract

The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.

LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature

TL;DR

This work tackles the fragmentation of inorganic synthesis knowledge by introducing LeMat-Synth, a multi-modal framework that uses LLMs and VLMs to automatically extract and structure synthesis procedures and performance data from a large corpus of open literature. An ontology of 35 synthesis methods and 16 material classes underpins a scalable pipeline that merges text and figure analysis, producing a machine-readable dataset (LeMat-Synth v1.0) from 81k papers and enabling data-driven synthesis planning and synthesis–structure–property modeling. The authors validate extraction quality with expert annotations and a scalable LLM-as-a-judge framework, while releasing an open-source software stack to extend the dataset to new domains. Although open-access bias and extraction limitations remain, this infrastructure establishes a foundation for predictive materials science and autonomous discovery workflows.

Abstract

The development of synthesis procedures remains a fundamental challenge in materials discovery, with procedural knowledge scattered across decades of scientific literature in unstructured formats that are challenging for systematic analysis. In this paper, we propose a multi-modal toolbox that employs large language models (LLMs) and vision language models (VLMs) to automatically extract and organize synthesis procedures and performance data from materials science publications, covering text and figures. We curated 81k open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing synthesis procedures spanning 35 synthesis methods and 16 material classes, structured according to an ontology specific to materials science. The extraction quality is rigorously evaluated on a subset of 2.5k synthesis procedures through a combination of expert annotations and a scalable LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source software library designed to support community-driven extension to new corpora and synthesis domains. Altogether, this work provides an extensible infrastructure to transform unstructured literature into machine-readable information. This lays the groundwork for predictive modeling of synthesis procedures as well as modeling synthesis--structure--property relationships.

Paper Structure

This paper contains 41 sections, 6 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Overview of the pipeline presented in this work. We fetch data from a corpus of over 2 million open-access papers from the arXiv, ChemRxiv, and Semantic Scholar, filtered down to 81k papers in materials science. When a synthesis procedure is identified, we extract both textual and visual content. We then parse materials and their synthesis procedures using a structured LLM pipeline. Figures are segmented, classified, and digitized using computer vision models and VLMs. The resulting structured records are evaluated, validated, and assembled into a standardized, extensible synthesis database.
  • Figure 2: Distribution of extraction scores across (a) the 10 most common synthesis methods and (b) material categories of the evaluation set of 2.5k synthesis procedures. The categories are ordered according to the mean. Vertical lines represent the 25th, 50th, and 75th percentiles, respectively. For a complete set of statistics across all material and synthesis categories, see \ref{['table:llm_syn_scores-synthesis-type']} and \ref{['table:llm_syn_scores-material-type']} in \ref{['app:sec:synth-extr-eval-llm-human']}.
  • Figure 3: Evaluation of the figure extraction pipeline. (a) Original figure from a source publication mateo2024challenges; (b) Reconstructed plot based on the manually digitized plot; (c) Reconstructed plot from data automatically extracted by our pipeline. The close visual alignment and low error metrics confirm the high accuracy of our automated figure parsing.
  • Figure 4: Statistics of the dataset evaluated in this work. (a) Distribution of action steps and (b) the 15 most common actions. (c) Distribution of the number of starting materials and (d) the 10 most common starting materials. Note that, similarly to material identifiers, starting materials are not standardized.
  • Figure 5: Synthesis procedures and methods for the evaluation set, colored according to the source of the underlying publication (arXiv, ChemRxiv, OMG24).
  • ...and 8 more figures