How Well Do Large Language Models Truly Ground?
Hyunji Lee, Sejune Joo, Chaeeun Kim, Joel Jang, Doyoung Kim, Kyoung-Woon On, Minjoon Seo
TL;DR
This work reframes grounding in large language models as a strict requirement: outputs must fully leverage the provided external context and remain within its scope, disallowing reliance on parametric knowledge. To study this, the authors construct a four-version dataset (Original-Gold, Original-Dist, Conflict-Gold, Conflict-Dist) and introduce an automatic grounding metric based on atomic facts and a cross-encoder evaluation model, enabling fine-grained assessment of how much of the context is used and whether extraneous information is injected. Across 25 LLMs with diverse sizes and training methods, the study finds that training approaches (instruction tuning, RLHF, DPO) tend to influence grounding more than model size, and that grounding is highly sensitive to distractors and the placement of gold context, with end-positioning of gold facts yielding better grounding. A key implication is that high answer accuracy does not guarantee true grounding, highlighting the need for grounding-aware evaluation when deploying knowledge-augmented LLMs. The work provides actionable insights for building more reliable, controllable LLM applications and sets a benchmark for evaluating true grounding in future models.
Abstract
To reduce issues like hallucinations and lack of control in Large Language Models (LLMs), a common method is to generate responses by grounding on external contexts given as input, known as knowledge-augmented models. However, previous research often narrowly defines "grounding" as just having the correct answer, which does not ensure the reliability of the entire response. To overcome this, we propose a stricter definition of grounding: a model is truly grounded if it (1) fully utilizes the necessary knowledge from the provided context, and (2) stays within the limits of that knowledge. We introduce a new dataset and a grounding metric to evaluate model capability under the definition. We perform experiments across 25 LLMs of different sizes and training methods and provide insights into factors that influence grounding performance. Our findings contribute to a better understanding of how to improve grounding capabilities and suggest an area of improvement toward more reliable and controllable LLM applications.
