Table of Contents
Fetching ...

LLM-Driven Multi-Agent Curation and Expansion of Metal-Organic Frameworks Database

Honghui Kim, Dohoon Kim, Jihan Kim

TL;DR

This work tackles the pervasive issue of structural errors in MOF databases by introducing LitMOF, an LLM-driven multi-agent framework that retrieves information from primary literature and existing databases to detect and repair MOF CIFs. The system's plan-and-execute architecture orchestrates five specialized agents to construct reference graphs, validate CIF structures, and apply corrections, yielding LitMOF-DB—118,464 computation-ready MOFs from an initial 128,799 CSD entries. It repairs thousands of entries (including 6,161 CoRE MOFs) and uncovers 12,646 missing MOFs reported in the literature, thereby expanding the experimental design space. The approach demonstrates a scalable, self-correcting pathway for materials data curation with potential generalization to other materials databases and curation tasks.

Abstract

Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high-throughput screening and machine-learning workflows, limiting the reliability of data-driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model-driven multi-agent framework that validates crystallographic information directly from the original literature and cross-validates it with database entries to repair structural errors. Applying LitMOF to the experimental MOF database (the CSD MOF Subset), we constructed LitMOF-DB, a curated set 118,464 computation-ready structures, including corrections of 69% (6,161 MOFs) of the invalid MOFs in the latest CoRE MOF database. Additionally, the system uncovered 12,646 experimentally reported MOFs absent from existing resources, substantially expanding the known experimental design space. This work establishes a scalable pathway toward self-correcting scientific databases and a generalizable paradigm for LLM-driven curation in materials science.

LLM-Driven Multi-Agent Curation and Expansion of Metal-Organic Frameworks Database

TL;DR

This work tackles the pervasive issue of structural errors in MOF databases by introducing LitMOF, an LLM-driven multi-agent framework that retrieves information from primary literature and existing databases to detect and repair MOF CIFs. The system's plan-and-execute architecture orchestrates five specialized agents to construct reference graphs, validate CIF structures, and apply corrections, yielding LitMOF-DB—118,464 computation-ready MOFs from an initial 128,799 CSD entries. It repairs thousands of entries (including 6,161 CoRE MOFs) and uncovers 12,646 missing MOFs reported in the literature, thereby expanding the experimental design space. The approach demonstrates a scalable, self-correcting pathway for materials data curation with potential generalization to other materials databases and curation tasks.

Abstract

Metal-organic framework (MOF) databases have grown rapidly through experimental deposition and large-scale literature extraction, but recent analyses show that nearly half of their entries contain substantial structural errors. These inaccuracies propagate through high-throughput screening and machine-learning workflows, limiting the reliability of data-driven MOF discovery. Correcting such errors is exceptionally difficult because true repairs require integrating crystallographic files, synthesis descriptions, and contextual evidence scattered across the literature. Here we introduce LitMOF, a large language model-driven multi-agent framework that validates crystallographic information directly from the original literature and cross-validates it with database entries to repair structural errors. Applying LitMOF to the experimental MOF database (the CSD MOF Subset), we constructed LitMOF-DB, a curated set 118,464 computation-ready structures, including corrections of 69% (6,161 MOFs) of the invalid MOFs in the latest CoRE MOF database. Additionally, the system uncovered 12,646 experimentally reported MOFs absent from existing resources, substantially expanding the known experimental design space. This work establishes a scalable pathway toward self-correcting scientific databases and a generalizable paradigm for LLM-driven curation in materials science.

Paper Structure

This paper contains 10 sections, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Schematic illustration of how LitMOF multi-agent system interacts with a user and generates responses. LitMOF consists of a Supervisor and five specialized agents, and the Supervisor interprets the user query and dispatches tasks to the appropriate agents. For the PICLAS example, LitMOF retrieves database records, extracts information from the publication, constructs a reference graph, and corrects structural errors in the CIF. LitMOF can also execute follow-up tasks, such as DFT geometry optimization, via the Simulation Runner.
  • Figure 2: a, Unified agent template comprising an LLM-driven head module and a set of nodes, each representing either another agent call or an LLM/tool operation. b, Decision process of the head module, which interprets the query, generates or updates a plan, selects the next node, and determines termination. c, Structure of an agent plan, represented as a overall goal and an ordered list of nodes with associated descriptions and execution statuses. d, Hierarchical plan-and-execute behaviour illustrated using the PICLAS correction workflow, where the Supervisor’s high-level plan expands into finer-grained plans executed by specialized agents.
  • Figure 3: Example of a missing MOF case (refcode: TEQLIM). Missing MOFs refer to structures that were synthesized and characterized in the literature but were not deposited as CIF files in the CSD. This example contains two such missing MOFs. For each missing MOF, LitMOF identifies the parent MOF (the most structurally similar MOF available in the CSD), the transformation required to obtain the missing MOF from its parent, and the reason the CIF is missing when explicitly provided in the paper.
  • Figure 4: Three types of error correction handled by the Inspector & Editor agent. a, Bond errors are corrected by adjusting the distance threshold used to determine bond formation, which adds or removes bonds as needed. b, Hydrogen errors are corrected using two complementary methods, identity mapping and graph matching. c, Disorder correction resolves duplicated or entangled components into chemically meaningful configurations through graph matching and MLIP-based energy evaluation.
  • Figure 5: a, Results of the database construction using the LitMOF agent. Starting from the CSD MOF subset containing 128,799 structures, we corrected 25,721 MOFs and constructed a curated database of 118,464 experimental MOFs with free solvent removed. During this process, we also identified 12,646 missing MOFs and compiled a separate missing-MOF database. b, Comparison between our curated MOF database and the latest CoRE MOF databasezhao_2025_15055758. c, The four most common transformations that relate a parent MOF to its corresponding missing MOF.
  • ...and 3 more figures