Table of Contents
Fetching ...

A Multi-Agent Human-LLM Collaborative Framework for Closed-Loop Scientific Literature Summarization

Maxwell J. Jacobson, Daniel Xie, Jackson Shen, Adil Wazeer, Haiyan Wang, Xinghang Zhang, Yexiang Xue

Abstract

Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi-step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi-agent, human-in-the-loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium-ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma-facing materials (PFMs) in fusion reactors. This demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery.

A Multi-Agent Human-LLM Collaborative Framework for Closed-Loop Scientific Literature Summarization

Abstract

Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi-step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi-agent, human-in-the-loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium-ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma-facing materials (PFMs) in fusion reactors. This demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery.

Paper Structure

This paper contains 30 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: AI has long assisted in understanding scientific literature. However, existing methods fall short in complex summarization tasks that require a long chain of reasoning. Our Elhuyar Framework integrates LLMs, structured AI, and human scientists in a closed-loop system that extracts, analyzes, and refines data from scientific literature iteratively, ensuring reliability while leading to deep understanding. When deployed in materials science, Elhuyar revealed nonlinear helium bubble growth in irradiated tungsten, demonstrating how AI-assisted literature review can uncover new scientific knowledge from literature.
  • Figure 2: Pipeline of the Elhuyar system. It involves two human roles: the scientist, who provides literature, asks a scientific query, and defines relevant data, and the inspector, who verifies low-confidence extracted data. The extraction path begins when the scientist queries the system, triggering a yes/no filtering agent that selects relevant documents, which are then processed by an extractor agent to extract data points multiple times. Iterative consensus scoring calculates confidence to reduce hallucinations and misreads, ultimately producing a filtered dataset. In the modeling path, a model selection agent chooses models to compare, and a model fit+eval agent fits the filtered data to these models, generating equations and evaluations. The response path consists of a single response agent that compiles a report with visualizations, equations, and a text summary answering the scientist's query. The final results allow the scientist to refine their question or expand the analysis with adjustments.
  • Figure 3: An image from a Transmission Electron Microscope (TEM) of tungsten after helium ion irradiation Iwakiri2000. The smaller, circular white spaces are helium bubbles. The larger, amorphous ones are another microstructure formed from irradiation (dislocation loops). Notice that both grow larger under higher irradiation temperatures. Also note how pervasive these are through the material. Understanding these microstructural changes can help us understand the material's macroscopic properties in applications like PFMs in fusion reactors.
  • Figure 4: Comparison of two models fitted by the system for predicting helium bubble size in tungsten under helium ion irradiation. (A) shows a linear model (assumes helium bubble size follows an additive relationship with irradiation temperature and dose). (B) shows an exponential model (bubble size grows with an exponential function of temperature and dose). The exponential model achieves a higher $R^2$ value, indicating a closer fit to the extracted data points. See Table \ref{['tab:model_comparison']} for detailed equations and fit metrics.