Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

Alessandro Stolfo

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

Alessandro Stolfo

TL;DR

The paper investigates groundedness in retrieval-augmented long-form QA, examining whether each generated sentence relies on retrieved documents or model pre-training data. It introduces a grounding-verification setup that separately assesses grounding to retrieved sources and pre-training corpora across three LFQA datasets and four model families, using EM-based correctness and a TRUE-based grounding model. The findings show a substantial portion of EM^+ generations remain ungrounded, even among large models, though grounding improves with model size, instruction tuning, and decoding via beam search. The work highlights persistent hallucination risks in LFQA and emphasizes the need for more robust grounding mechanisms and decoding strategies to reliably tether long-form answers to credible sources, with practical implications for safe deployment and evaluation of retrieval-augmented LLMs.

Abstract

We present an empirical study of groundedness in long-form question answering (LFQA) by retrieval-augmented large language models (LLMs). In particular, we evaluate whether every generated sentence is grounded in the retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded, even when those sentences contain correct ground-truth answers. Additionally, we examine the impacts of factors such as model size, decoding strategy, and instruction tuning on groundedness. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations. This study provides novel insights into the groundedness challenges in LFQA and underscores the necessity for more robust mechanisms in LLMs to mitigate the generation of ungrounded content.

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

TL;DR

Abstract

Paper Structure (39 sections, 3 equations, 9 figures, 4 tables)

This paper contains 39 sections, 3 equations, 9 figures, 4 tables.

Introduction
Background
Hallucination & Factuality.
Setting.
Experimental Procedure
Notation
Measuring Correctness
Measuring Groundedness
Groundedness in the retrieved documents.
Groundedness in the pre-training data.
Groundedness scores.
Experimental Setup
Datasets.
Models.
Grounding.
...and 24 more sections

Figures (9)

Figure 1: Our experimental setup. Using a set of retrieved documents (1), an LLM generates an answer in an LFQA setting (2). Then, the model’s pre-training corpus is searched for documents related to the generation (3). Finally, a grounding model verifies whether the model’s response is supported by any of the considered documents (4).
Figure 2: Groundedness & correctness. Each of the 8 sectors in the chart corresponds to a specific combination of groundedness (in the retrieved documents, pre-training data, both, or neither) and EM correctness (either belonging to $\mathrm{EM}^0$ or $\mathrm{EM}^+$). The area of a sector corresponds to the fraction of all model-generated sentences over all ASQA test examples that exhibit that groundedness-correctness combination.
Figure 3: Groundedness across datasets. The height of each bar represents the fraction of generated sentences that belong to partially correct generations. A significant fraction of these sentences are not grounded in either the retrieved or pre-training documents.
Figure 4: Groundedness by size. As before, the height of each bar represents the fraction of generated sentences that belong to partially correct generations. Increased model size correlates with an increase in the number of sentences in $\mathrm{EM}^+$, but also an increase in groundedness.
Figure 5: Groundedness by decoding. Exact Match (EM) scores against the minimum fraction of sentences required for a model generation to be considered valid (groundedness threshold). As the grounding threshold tightens, the EM scores for random sampling quickly degrade. The scores for beam search, however, remain roughly unaltered, indicating a higher level of groundedness. Results obtained with Pythia 12B on ASQA.
...and 4 more figures

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

TL;DR

Abstract

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (9)