Automated Review Generation Method Based on Large Language Models

Shican Wu; Xiao Ma; Dehui Luo; Lulu Li; Xiangcheng Shi; Xin Chang; Xiaoyun Lin; Ran Luo; Chunlei Pei; Changying Du; Zhi-Jian Zhao; Jinlong Gong

Automated Review Generation Method Based on Large Language Models

Shican Wu, Xiao Ma, Dehui Luo, Lulu Li, Xiangcheng Shi, Xin Chang, Xiaoyun Lin, Ran Luo, Chunlei Pei, Changying Du, Zhi-Jian Zhao, Jinlong Gong

TL;DR

The study tackles the problem of exploding scientific literature by introducing an end-to-end automated review generation method based on large language models (LLMs). It combines literature retrieval, topic formulation, knowledge extraction, and review composition within a modular pipeline, augmented by a dual-baseline evaluation framework and multi-layer hallucination mitigation. A propane dehydrogenation (PDH) catalyst case study demonstrates cross-disciplinary applicability, processing 343 articles (and 1041 in extended analysis) across 35 topics, with near-manual quality and robust citation tracing; hallucination probability is reduced to below 0.5% with 95% confidence. An open-source Windows GUI enables one-click review generation, and data mining insights into catalyst design are provided, highlighting broad potential for automated literature analysis across domains and for accelerating scientific knowledge dissemination.

Abstract

Literature research, vital for scientific work, faces the challenge of surging information volumes exceeding researchers' processing capabilities. We present an automated review generation method based on large language models (LLMs) to overcome efficiency bottlenecks and reduce cognitive load. Our statistically validated evaluation framework demonstrates that the generated reviews match or exceed manual quality, offering broad applicability across research fields without requiring users' domain knowledge. Applied to propane dehydrogenation (PDH) catalysts, our method swiftly analyzed 343 articles, averaging seconds per article per LLM account, producing comprehensive reviews spanning 35 topics, with extended analysis of 1041 articles providing insights into catalysts' properties. Through multi-layered quality control, we effectively mitigated LLMs' hallucinations, with expert verification confirming accuracy and citation integrity while demonstrating hallucination risks reduced to below 0.5\% with 95\% confidence. Released Windows application enables one-click review generation, enhancing research productivity and literature recommendation efficiency while setting the stage for broader scientific explorations.

Automated Review Generation Method Based on Large Language Models

TL;DR

Abstract

Paper Structure (21 sections, 7 figures, 1 table)

This paper contains 21 sections, 7 figures, 1 table.

INTRODUCTION
RESULTS AND DISCUSSION
Automated retrieval
Implementation and analysis of one-click automated review generation
Evaluation of generated review quality
Data mining and visual analysis
Hallucination mitigation
CONCLUSIONS
METHODS
Literature search
Topic formulation
Knowledge extraction
Review composition
Data mining
Quality Assessment
...and 6 more sections

Figures (7)

Figure 1: Reliability verification results of the dual-baseline review quality assess-
Figure 2: Quality assessment results of automatically generated reviews. Heat map of the percentage difference in scores of review paragraphs generated by this method relative to human scores, red to green showing -100% to +100% range, higher values indicate better performance, values truncated to ±100% range, values exceeding are recorded as -100% and +100%: a, Highest scoring paragraph of Claude3.5 Sonnet model; b, Highest scoring paragraph of Qwen2-72b-Instruct model; c, Average paragraph score of Claude3.5 Sonnet model; d, Average paragraph score of Qwen2-72b-Instruct model. e, Histogram of percentage differences in scores relative to human scores for highest scoring paragraphs, average paragraph scores, and directly generated paragraph scores without going through this method for Claude3.5 Sonnet model and Qwen2-72b-Instruct model, colors ranging from dark to light representing Claude3.5
Figure 3: Example of visual analysis results. Line charts for annual publication numbers: a, different catalyst types; b, Performance enhancement sources. Radar charts for peak performance of single factors, with selectivity (black) and stability (purple) scales: c, Promoter elements; d, Support materials. Bubble charts for dual-variable correlations, show selectivity (color depth), conversion rate (bubble size), and stability (bubble edge thickness), aiming for high selectivity, conversion rate, and stability. Data includes only those with selectivity $\geq$85%, conversion rate $\geq$45%, stability $\geq$1h: e, Active site element-composition element; f, Alloy structure type-preparation method. Complete data charts are available in the SI.
Figure 4: Effectiveness of hallucination mitigation.a, Consistency as determined by LLMs between direct LLM responses and aggregated results during the knowledge extraction phase, where blue represents 100% consistency and orange less than 100%. b, Distribution of manual sampling results for direct LLM responses and aggregated outcomes during the data mining phase, with TP (True Positive), TN (True Negative), FP (False Positive), FN (False Negative)
Figure 5: a, Flowchart of the automated review generation method based on large language models. It includes four modules: i) literature search, ii) topic formulation, iii) knowledge extraction, iv) review composition, as well as an additional data mining module. b, Flowchart of the quality assessment framework for review generation based on large language models.
...and 2 more figures

Automated Review Generation Method Based on Large Language Models

TL;DR

Abstract

Automated Review Generation Method Based on Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)