AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

Wei Yang; Zihao Liu; Tao Tan; Xiao Hu; Hong Xie; Lulu Li Xin Li; Jianyu Han; Defu Lian; Mao Ye

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

Wei Yang, Zihao Liu, Tao Tan, Xiao Hu, Hong Xie, Lulu Li Xin Li, Jianyu Han, Defu Lian, Mao Ye

TL;DR

A large language model (LLM) agent named AgentCAT is presented, which extracts and analyzes catalytic reaction data from chemical engineering papers and supports natural language based interactive analysis of the extracted data and its natural language based interactive data analysis functionality is friendly to the community.

Abstract

This paper presents a large language model (LLM) agent named AgentCAT, which extracts and analyzes catalytic reaction data from chemical engineering papers, %and supports natural language based interactive analysis of the extracted data. AgentCAT serves as an alternative to overcome the long-standing data bottleneck in chemical engineering field, and its natural language based interactive data analysis functionality is friendly to the community. AgentCAT also presents a formal abstraction and challenge analysis of the catalytic reaction data extraction task in an artificial intelligence-friendly manner. This abstraction would help the artificial intelligence community understand this problem and in turn would attract more attention to address it. Technically, the complex catalytic process leads to complicated dependency structure in catalytic reaction data with respect to elementary reaction steps, molecular behaviors, measurement evidence, etc. This dependency structure makes it challenging to guarantee the correctness and completeness of data extraction, as well as representing them for analysis. AgentCAT addresses this challenge and it makes four folds of technical contributions: (1) a schema-governed extraction pipeline with progressive schema evolution, enabling robust data extraction from chemical engineering papers; (2) a dependency-aware reaction-network knowledge graph that links catalysts/active sites, synthesis-derived descriptors, mechanistic claims with evidence, and macroscopic outcomes, preserving process coupling and traceability; (3) a general querying module that supports natural-language exploration and visualization over the constructed graph for cross-paper analysis; (4) an evaluation on $\sim$800 peer-reviewed chemical engineering publications demonstrating the effectiveness of AgentCAT.

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

TL;DR

Abstract

800 peer-reviewed chemical engineering publications demonstrating the effectiveness of AgentCAT.

Paper Structure (22 sections, 8 figures, 3 tables)

This paper contains 22 sections, 8 figures, 3 tables.

Introduction
Related Work
Preliminary&Challenges
Graph Abstraction of Catalytic Reaction Data in Chemical Engineering
Challenges for General LLMs
The Design of AgentCAT
Architectural Overview
Adaptive Information Extraction
Knowledge Graph Construction
General Querying and Graph Exploration
Performance Evaluation
Evaluation Settings
Performance of the Schema Evolution
Performance of Data Extracting
Performance of General Querying Agent
...and 7 more sections

Figures (8)

Figure 1: A graph abstraction of the backbone of the catalytic reaction data in chemical engineering.
Figure 2: A simplified SSP-oriented schematic of how microscopic mechanistic events governed by catalyst active sites and the pore environment propagate to macroscopic reaction outcomes in H-ZSM-5 catalyzed MTO.
Figure 3: Architectural overview of AgentCAT, showing the multi-phase process of converting raw PDF literature into structured data and constructing a Knowledge Graph through adaptive extraction, schema evolution, and dynamic graph exploration.
Figure 4: Number of newly introduced schema items per evolution round (each round processes one PDF).
Figure 5: Extraction quality and reliability of AgentCAT on a large catalysis corpus. (a) Distribution of extracted JSON length as a coarse proxy for information density. (b) Expert blind ratings of extracted outputs across Completeness/Accuracy/Readability . (c) Overall review-verdict distribution over all extracted sections.
...and 3 more figures

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

TL;DR

Abstract

AgentCAT: An LLM Agent for Extracting and Analyzing Catalytic Reaction Data from Chemical Engineering Literature

Authors

TL;DR

Abstract

Table of Contents

Figures (8)