How Well Do Large Language Models Truly Ground?

Hyunji Lee; Sejune Joo; Chaeeun Kim; Joel Jang; Doyoung Kim; Kyoung-Woon On; Minjoon Seo

How Well Do Large Language Models Truly Ground?

Hyunji Lee, Sejune Joo, Chaeeun Kim, Joel Jang, Doyoung Kim, Kyoung-Woon On, Minjoon Seo

TL;DR

This work reframes grounding in large language models as a strict requirement: outputs must fully leverage the provided external context and remain within its scope, disallowing reliance on parametric knowledge. To study this, the authors construct a four-version dataset (Original-Gold, Original-Dist, Conflict-Gold, Conflict-Dist) and introduce an automatic grounding metric based on atomic facts and a cross-encoder evaluation model, enabling fine-grained assessment of how much of the context is used and whether extraneous information is injected. Across 25 LLMs with diverse sizes and training methods, the study finds that training approaches (instruction tuning, RLHF, DPO) tend to influence grounding more than model size, and that grounding is highly sensitive to distractors and the placement of gold context, with end-positioning of gold facts yielding better grounding. A key implication is that high answer accuracy does not guarantee true grounding, highlighting the need for grounding-aware evaluation when deploying knowledge-augmented LLMs. The work provides actionable insights for building more reliable, controllable LLM applications and sets a benchmark for evaluating true grounding in future models.

Abstract

To reduce issues like hallucinations and lack of control in Large Language Models (LLMs), a common method is to generate responses by grounding on external contexts given as input, known as knowledge-augmented models. However, previous research often narrowly defines "grounding" as just having the correct answer, which does not ensure the reliability of the entire response. To overcome this, we propose a stricter definition of grounding: a model is truly grounded if it (1) fully utilizes the necessary knowledge from the provided context, and (2) stays within the limits of that knowledge. We introduce a new dataset and a grounding metric to evaluate model capability under the definition. We perform experiments across 25 LLMs of different sizes and training methods and provide insights into factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications.

How Well Do Large Language Models Truly Ground?

TL;DR

Abstract

Paper Structure (59 sections, 18 figures, 14 tables)

This paper contains 59 sections, 18 figures, 14 tables.

Introduction
Related Works
Question Answering
Generating Response with External Knowledge
Definition of Grounding
Grounding
Definition & Usage
Dataset Construction
Step 1: Context Selection
Step 2: Instance Generation & Classification
Step 3: Gold Atomic Fact Selection
Step 4: Modify Context
Step 5: Add Distractor Contexts
Metric
Grounding Performance
...and 44 more sections

Figures (18)

Figure 1: An example scenario of a company's HR team using LLM to question upon candidate's resume which is given as input context. The previous definition of grounding would consider responses 1 and 2 as well grounded due to their high relevancy with the question and input context. However, as our definition considers all knowledge in a fine-grained manner, we consider only response 3 as well-grounded. Response 1 misses key resume detail (2) which makes the candidate underrated. Response 2 introduces knowledge (a) that is not from the given context but from the model's parametric knowledge, inaccurately overrates the candidate, and unfairly influences comparison with others.
Figure 2: Four versions of our dataset: Original-Gold, Original-Dist, Conflict-Gold, and Conflict-Dist. Conflict-* contains modified gold contexts (conflict context) by human annotators. *-Dist differs from *-Gold in that it contains distractor contexts. The left part of the figure shows three key factors we considered when constructing our dataset.
Figure 3: (a) shows grounding performance for each model size in Original-Gold. The performance tends to depend more heavily on how the model was tuned rather than the model size. (b) shows RULES performance and grounding performance. There is a weak correlation between instruction-following ability and grounding performance. (c) shows details of grounding performance by the characteristics of queries and contexts in Original-Gold. Llama2 and Vicuna are 13B, Falcon is 40B model.
Figure 4: Grounding performance of Vicuna-13B-16k as length of input contexts increases.
Figure 5: Reduction rate in Original-Dist performance from Original-Gold. Models with the same base model are in the same color. Models that are instruction tuned (falcon_I, GPT_I, Vicuna) or underwent RLHF (Llama2_C) show higher degradation when distractor contexts are added. Vicuna and Llama2 are 13B and Falcon is 40B model.
...and 13 more figures

How Well Do Large Language Models Truly Ground?

TL;DR

Abstract

How Well Do Large Language Models Truly Ground?

Authors

TL;DR

Abstract

Table of Contents

Figures (18)