SFR-RAG: Towards Contextually Faithful LLMs

Xuan-Phi Nguyen; Shrey Pandit; Senthil Purushwalkam; Austin Xu; Hailin Chen; Yifei Ming; Zixuan Ke; Silvio Savarese; Caiming Xong; Shafiq Joty

SFR-RAG: Towards Contextually Faithful LLMs

Xuan-Phi Nguyen, Shrey Pandit, Senthil Purushwalkam, Austin Xu, Hailin Chen, Yifei Ming, Zixuan Ke, Silvio Savarese, Caiming Xong, Shafiq Joty

TL;DR

The paper targets faithful factual grounding in retrieval augmented generation by proposing SFR-RAG, a 9B instruction-tuned LLM focused on context-grounded generation and minimal hallucination. It introduces ContextualBench, a standardized, reproducible evaluation suite spanning seven contextual QA tasks, to enable fair comparisons. Experimental results show SFR-RAG-9B achieving state-of-the-art on three ContextualBench tasks and outperforming larger open baselines on several others, while maintaining competitiveness on standard benchmarks and function-calling tasks. FaithEval analyses reveal robust fidelity to contextual information even under unknown, conflicting, or counterfactual settings, highlighting practical resilience for real-world RAG deployments.

Abstract

Retrieval Augmented Generation (RAG), a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance, has emerged as a pivotal area in generative AI. The LLMs used in RAG applications are required to faithfully and completely comprehend the provided context and users' questions, avoid hallucination, handle unanswerable, counterfactual or otherwise low-quality and irrelevant contexts, perform complex multi-hop reasoning and produce reliable citations. In this paper, we introduce SFR-RAG, a small LLM that is instruction-tuned with an emphasis on context-grounded generation and hallucination minimization. We also present ContextualBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks, such as HotpotQA and TriviaQA, with consistent RAG settings to ensure reproducibility and consistency in model assessments. Experimental results demonstrate that our SFR-RAG-9B model outperforms leading baselines such as Command-R+ (104B) and GPT-4o, achieving state-of-the-art results in 3 out of 7 benchmarks in ContextualBench with significantly fewer parameters. The model is also shown to be resilient to alteration in the contextual information and behave appropriately when relevant context is removed. Additionally, the SFR-RAG model maintains competitive performance in general instruction-following tasks and function-calling capabilities.

SFR-RAG: Towards Contextually Faithful LLMs

TL;DR

Abstract

Paper Structure (12 sections, 3 figures, 5 tables)

This paper contains 12 sections, 3 figures, 5 tables.

Introduction
SFR-RAG
SFR-RAG Chat Template
SFR-RAG Fine-tuning Process
Evaluation
Contextual Evaluation Suite - ContextualBench
Dataset Specific Settings.
Experimental Results on ContextualBench
Resilience to Unanswerable, Conflicting and Counterfactual Contexts
Standard Benchmarks
Conclusion
Appendix

Figures (3)

Figure 1: Our SFR-RAG-9B model exhibits strong overall performance on an ContextualBench, our comprehensive evaluation suite of seven contextual tasks under a standardized setup. Notably, SFR-RAG achieves state-of-the-art performance on three of seven tasks, with extremely competitive performance on the rest, despite having far fewer parameters than competitive baselines.
Figure 2: Example of the chat format used by SFR-RAG, with additional Thought and Observation turns (roles). The former indicates the model's "inner" thought or reasoning, actions and tool use syntax that are not typically meant to be shown to users. The latter indicates all external information retrieved and returned by performing a search or function call. The Assistant turn, therefore, is relieved to only be responsible to generate user-friendly responses. During training, Thought and Assistant turns are trained while the others are masked out.
Figure 3: FaithEval faitheval: average easy match accuracy scores of different models when contextual facts are fabricated (Counterfactual), removed (Unknown) or when the facts are contradicting (Conflict). Small variations between those settings and overall high absolute scores indicate that SFR-RAG-9B is resilient to changes in contextual information.

SFR-RAG: Towards Contextually Faithful LLMs

TL;DR

Abstract

SFR-RAG: Towards Contextually Faithful LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)