Table of Contents
Fetching ...

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

TL;DR

FaithEval tackles the problem of models remaining faithful to retrieved or provided context in retrieval-augmented settings. It presents a comprehensive benchmark with 4.9K contextual QA samples across unanswerable, inconsistent, and counterfactual contexts, built via a four-stage context construction and validation pipeline that combines LLM-based auto-evaluation with human annotation. An evaluation across 18 models shows that faithfulness to context is hard even for state-of-the-art systems, and larger model sizes do not guarantee improvements; prompting strategies like chain-of-thought help but do not close the gap. The work delivers insights into how context quality and prompt design affect groundedness, and it calls for holistic evaluation frameworks to drive the development of more faithful contextual LLMs for real-world applications.

Abstract

Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.Project is available at: https://github.com/SalesforceAIResearch/FaithEval.

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

TL;DR

FaithEval tackles the problem of models remaining faithful to retrieved or provided context in retrieval-augmented settings. It presents a comprehensive benchmark with 4.9K contextual QA samples across unanswerable, inconsistent, and counterfactual contexts, built via a four-stage context construction and validation pipeline that combines LLM-based auto-evaluation with human annotation. An evaluation across 18 models shows that faithfulness to context is hard even for state-of-the-art systems, and larger model sizes do not guarantee improvements; prompting strategies like chain-of-thought help but do not close the gap. The work delivers insights into how context quality and prompt design affect groundedness, and it calls for holistic evaluation frameworks to drive the development of more faithful contextual LLMs for real-world applications.

Abstract

Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.Project is available at: https://github.com/SalesforceAIResearch/FaithEval.
Paper Structure (30 sections, 21 figures, 8 tables)

This paper contains 30 sections, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Performance summary on FaithEval Benchmark. Each bar shows the combined accuracy (normalized) for the best model from each organization across three tasks: Counterfactual, Inconsistent, and Unanswerable (Sec \ref{['sec:task_setup']}). Different colors in each bar represent the accuracy for each task.
  • Figure 2: Demonstration of each task in FaithEval. Left: in Unanswerable Context, the context does not contain the answer to the question. Middle: in Inconsistent Context, multiple answers are supported by different documents. Right: in Counterfactual Context, the context contains counterfactual statements that contradict common sense or world knowledge. Complete contexts can be seen in Appendix \ref{['app:detail_prompt']}.
  • Figure 3: Illustration of task construction and validation framework. (1) Context Generation: given a source QA dataset, we prompt an LLM to generate a new context based on a question, the original answer, and optionally the original context. (2) Task Construction: we construct the prompt for each sample by combining the original question, the new context, and task-specific instructions. (3) Auto Eval by LLM Judge: we validate the quality of the new context by checking if and only-if the new answer is supported by the new context. (4) Human Annotation: we further filter out invalid contextual QA pairs based on the majority vote results from professional annotators.
  • Figure 4: Model performance comparison on the Unanswerable Context task, where no evidence supports the answer. Columns are sorted by performance on the Original task (original context). Proprietary model names are highlighted in orange.
  • Figure 5: Model performance comparison on the Inconsistent Context task. Columns are sorted by the performance on the original task. Proprietary models are colored in orange.
  • ...and 16 more figures