Table of Contents
Fetching ...

Reducing Hallucinations in LLM-based Scientific Literature Analysis Using Peer Context Outlier Detection

Daniel Xie, Maxwell J. Jacobson, Adil Wazeer, Haiyan Wang, Xinghang Zhang, Yexiang Xue

Abstract

Reducing hallucinations in Large Language Models (LLMs) is essential for improving the accuracy of data extraction from large text corpora. Current methods, like prompt engineering and chain-of-thought prompting, focus on individual documents but fail to consider relationships across a corpus. This paper introduces Peer Context Outlier Detection (P-COD), a novel approach that uses the relationships between documents to improve extraction accuracy. Our application domain is in scientific literature summarization, where papers with similar experiment settings should draw similar conclusions. By comparing extracted data to validated peer information within the corpus, we adjust confidence scores and flag low-confidence results for expert review. High-confidence results, supported by peer validation, are considered reliable. Our experiments demonstrate up to 98% precision in outlier detection across 6 domains of science, demonstrating that our design reduces hallucinations, enhances trust in automated systems, and allows researchers to focus on ambiguous cases, streamlining the data extraction workflows.

Reducing Hallucinations in LLM-based Scientific Literature Analysis Using Peer Context Outlier Detection

Abstract

Reducing hallucinations in Large Language Models (LLMs) is essential for improving the accuracy of data extraction from large text corpora. Current methods, like prompt engineering and chain-of-thought prompting, focus on individual documents but fail to consider relationships across a corpus. This paper introduces Peer Context Outlier Detection (P-COD), a novel approach that uses the relationships between documents to improve extraction accuracy. Our application domain is in scientific literature summarization, where papers with similar experiment settings should draw similar conclusions. By comparing extracted data to validated peer information within the corpus, we adjust confidence scores and flag low-confidence results for expert review. High-confidence results, supported by peer validation, are considered reliable. Our experiments demonstrate up to 98% precision in outlier detection across 6 domains of science, demonstrating that our design reduces hallucinations, enhances trust in automated systems, and allows researchers to focus on ambiguous cases, streamlining the data extraction workflows.

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: High-level idea behind Peer Context Outlier Detection. Each point represents a scientific paper, with color indicating an experimental result value. Papers that are semantically similar -- meaning they share similar research methodologies -- are positioned closely together in this vector space. The black flag marks an outlier, a data point flagged due to a significant deviation in experimental results compared to its semantically similar peers. This discrepancy, highlighted by color deviation, suggests a potential inconsistency in the LLM extraction. Our method identifies and flags surprising data points for human review, enhancing trust and accuracy in automated scientific data extraction.
  • Figure 2: This figure illustrates the workflow of our Peer Context Outlier Detection system for LLM-based scientific data extraction. The process begins with an extraction LLM retrieving relevant data from scientific literature based on user-defined instructions. Simultaneously, a text embedding LLM generates semantic text vectors, which place papers into a shared space where their proximity reflects their similarity -- closer papers discuss similar research methods and experiments. Extracted data is then compared within its peer group, where a surprising score is assigned based on deviations in experimental results. If a paper produces a result that is significantly different from its closely related peers, it is flagged as a potential anomaly for human review. The system visualizes these flagged points with color, making it easy to identify outliers with significant color deviation, and ensuring that only the most uncertain or inconsistent extractions are prioritized for human verification -- improving the reliability of automated scientific data extraction.
  • Figure 3: Multi-Domain Validation Results Across Six Scientific Fields. Each subplot shows semantic clustering and outlier detection for one domain with domain-specific experimental metrics: Computer Science (Model Accuracy, 70-95%), Physics (Optical Wavelength, 400-800 nm), Biology (Protein Concentration, 2-1000 ng/mL), Chemistry (Reaction Yield, 50-100%), Materials Science (Tensile Strength, 100-2000 MPa), and Environmental Science (Atmospheric CO2, 380-450 ppm). Points represent individual papers positioned by semantic similarity, with colors indicating experimental values according to the scale bars. Black flags mark the synthetic corrupted papers identified by P-COD as peer group outliers due to anomalous experimental values relative to their semantically similar neighbors. The clustering is often quite tight with significant overlap between similar papers, making individual outliers difficult to distinguish visually, but P-COD successfully flags papers that exceed the deviation threshold for human review. The "color-pop" effect makes outliers immediately visible across diverse scientific domains, demonstrating P-COD's effectiveness with realistic experimental ranges and domain-appropriate metrics.
  • Figure 4: Large-Scale Single-Domain Clustering Results across 8 Computer Science Sub-Fields. While all 200 papers belong to Computer Science, they span diverse sub-fields (Deep Learning Optimization, Reinforcement Learning, Graph Neural Networks, Blockchain Technology, Data Mining, Cloud Computing, Information Retrieval, and Bioinformatics) that share a unified accuracy metric (0.70-0.95). The colorbar represents model performance accuracy across all CS domains - from neural network accuracy to blockchain consensus success rates. The increased cluster separation (factor=3.0) enhances visualization of subtle semantic differences within the CS domain. Due to the high number of outliers detected by P-COD in this dataset, black flags mark only a representative sample of papers identified as peer outliers, though many more anomalous papers were successfully detected. The tight clustering demonstrates that CS papers are semantically closer than cross-domain papers, while P-COD maintains precision in detecting experimental anomalies even within closely related research areas.