Table of Contents
Fetching ...

SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA

Haozhou Xu, Dongxia Wu, Matteo Chinazzi, Ruijia Niu, Rose Yu, Yi-An Ma

TL;DR

SimulRAG introduces a simulator-based retrieval-augmented generation framework to ground LLMs for long-form scientific QA and mitigate hallucinations. It offers a generalized simulator retrieval interface to bridge textual and numerical modalities, and a claim-level generator that decomposes answers into atomic claims verifiable against simulator outputs. The framework leverages uncertainty estimation and simulator boundary assessment (UE+SBA) to selectively verify and update claims, improving efficiency. A climate and epidemiology benchmark with ground-truth simulation- and human-annotated answers demonstrates substantial gains in informativeness (≈30%) and factuality (≈16%) over traditional RAG baselines, with UE+SBA further enhancing efficiency and quality. The work advances trustworthy AI for scientific reasoning by integrating simulators as retrieval tools and providing a scalable evaluation platform for long-form QA.

Abstract

Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate SimulRAG outperforms traditional RAG baselines by 30.4% in informativeness and 16.3% in factuality. UE+SBA further improves efficiency and quality for claim-level generation.

SimulRAG: Simulator-based RAG for Grounding LLMs in Long-form Scientific QA

TL;DR

SimulRAG introduces a simulator-based retrieval-augmented generation framework to ground LLMs for long-form scientific QA and mitigate hallucinations. It offers a generalized simulator retrieval interface to bridge textual and numerical modalities, and a claim-level generator that decomposes answers into atomic claims verifiable against simulator outputs. The framework leverages uncertainty estimation and simulator boundary assessment (UE+SBA) to selectively verify and update claims, improving efficiency. A climate and epidemiology benchmark with ground-truth simulation- and human-annotated answers demonstrates substantial gains in informativeness (≈30%) and factuality (≈16%) over traditional RAG baselines, with UE+SBA further enhancing efficiency and quality. The work advances trustworthy AI for scientific reasoning by integrating simulators as retrieval tools and providing a scalable evaluation platform for long-form QA.

Abstract

Large language models (LLMs) show promise in solving scientific problems. They can help generate long-form answers for scientific questions, which are crucial for comprehensive understanding of complex phenomena that require detailed explanations spanning multiple interconnected concepts and evidence. However, LLMs often suffer from hallucination, especially in the challenging task of long-form scientific question answering. Retrieval-Augmented Generation (RAG) approaches can ground LLMs by incorporating external knowledge sources to improve trustworthiness. In this context, scientific simulators, which play a vital role in validating hypotheses, offer a particularly promising retrieval source to mitigate hallucination and enhance answer factuality. However, existing RAG approaches cannot be directly applied for scientific simulation-based retrieval due to two fundamental challenges: how to retrieve from scientific simulators, and how to efficiently verify and update long-form answers. To overcome these challenges, we propose the simulator-based RAG framework (SimulRAG) and provide a long-form scientific QA benchmark covering climate science and epidemiology with ground truth verified by both simulations and human annotators. In this framework, we propose a generalized simulator retrieval interface to transform between textual and numerical modalities. We further design a claim-level generation method that utilizes uncertainty estimation scores and simulator boundary assessment (UE+SBA) to efficiently verify and update claims. Extensive experiments demonstrate SimulRAG outperforms traditional RAG baselines by 30.4% in informativeness and 16.3% in factuality. UE+SBA further improves efficiency and quality for claim-level generation.

Paper Structure

This paper contains 33 sections, 4 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Left: Overall SimulRAG structure, including the simulator-based retriever and claim-level generator. Right: Simulator retrieval interface: (1) prompting LLM with question and handbook to extract simulator parameter settings; (2) executing simulator with parameters to obtain simulation outputs; (3) converting outputs to textual context via predefined or LLM-generated templates.
  • Figure 2: Claim-level generation process: (1) decompose long-form answers into atomic claims; (2) apply uncertainty estimation and simulator boundary assessment to select claims for verification; (3) update selected claims using simulation context; (4) integrate verified claims into a coherent answer.
  • Figure 3: Example questions and answers from our benchmark dataset for climate and epidemiology.
  • Figure 4: Factuality and informativeness comparison across RAG methods. SimulRAG consistently outperforms baseline methods on both evaluation metrics, showing superior capability for generating informative and factual long-form scientific answers.
  • Figure 5: Left: Proportion of SBA-selected claims verifiable by simulator. Right: Performance comparison between Uncertainty and UE+SBA across five uncertainty estimation scores on climate science and epidemiology benchmarks.
  • ...and 3 more figures