FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty
TL;DR
FaithEval tackles the problem of models remaining faithful to retrieved or provided context in retrieval-augmented settings. It presents a comprehensive benchmark with 4.9K contextual QA samples across unanswerable, inconsistent, and counterfactual contexts, built via a four-stage context construction and validation pipeline that combines LLM-based auto-evaluation with human annotation. An evaluation across 18 models shows that faithfulness to context is hard even for state-of-the-art systems, and larger model sizes do not guarantee improvements; prompting strategies like chain-of-thought help but do not close the gap. The work delivers insights into how context quality and prompt design affect groundedness, and it calls for holistic evaluation frameworks to drive the development of more faithful contextual LLMs for real-world applications.
Abstract
Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.Project is available at: https://github.com/SalesforceAIResearch/FaithEval.
