Table of Contents
Fetching ...

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria

TL;DR

This work tackles the challenge of evaluating and improving the trustworthiness of LLMs within retrieval-augmented generation (RAG) by introducing Trust-Score, a holistic metric that evaluates groundedness, refusals, and citation quality independent of retriever performance. It identifies limitations of existing evaluation paradigms and proposes Answer Calibration, Grounded Refusals, and Attribution Groundedness as core components. To drive improvements, the authors construct the Trust-Align dataset and train models with Direct Preference Optimization, achieving substantial gains in Trust-Score across 26 of 27 model-family configurations and enhancing refusal and citation-groundedness while maintaining reasonable answer correctness. The approach generalizes across model families and sizes, including open-weight models, and narrows the gap with closed systems like GPT-4, offering a practical pathway to more trustworthy RAG systems.

Abstract

LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5. Specifically, in LLaMA-3-8b, Trust-Align outperforms FRONT on ASQA (up 12.56), QAMPARI (up 36.04), and ELI5 (up 17.69). Trust-Align also significantly enhances models' ability to correctly refuse and provide quality citations. We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at https://github.com/declare-lab/trust-align.

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

TL;DR

This work tackles the challenge of evaluating and improving the trustworthiness of LLMs within retrieval-augmented generation (RAG) by introducing Trust-Score, a holistic metric that evaluates groundedness, refusals, and citation quality independent of retriever performance. It identifies limitations of existing evaluation paradigms and proposes Answer Calibration, Grounded Refusals, and Attribution Groundedness as core components. To drive improvements, the authors construct the Trust-Align dataset and train models with Direct Preference Optimization, achieving substantial gains in Trust-Score across 26 of 27 model-family configurations and enhancing refusal and citation-groundedness while maintaining reasonable answer correctness. The approach generalizes across model families and sizes, including open-weight models, and narrows the gap with closed systems like GPT-4, offering a practical pathway to more trustworthy RAG systems.

Abstract

LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. 26 out of 27 models aligned using Trust-Align substantially outperform competitive baselines on ASQA, QAMPARI, and ELI5. Specifically, in LLaMA-3-8b, Trust-Align outperforms FRONT on ASQA (up 12.56), QAMPARI (up 36.04), and ELI5 (up 17.69). Trust-Align also significantly enhances models' ability to correctly refuse and provide quality citations. We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at https://github.com/declare-lab/trust-align.
Paper Structure (83 sections, 10 equations, 7 figures, 27 tables)

This paper contains 83 sections, 10 equations, 7 figures, 27 tables.

Figures (7)

  • Figure 1: Trust-Score calculation shown as a computational graph.
  • Figure 2: Overview of the Trust-Align. Left: The curation of both seed and augmented prompts (Q-D pairs) and an example of the answerability labeling process during the retrieval stage. Right: The response paired data generation process. First, we obtain positive answers and then select hard negative answers. Finally, we align our model via DPO.
  • Figure 3: Document recombination process in augmented prompt curation.
  • Figure 4: Claim-document-mapping process.
  • Figure 5: Statistics of hallucinations from the output of LLaMA-2-7b SFT model prompted using 70K $(q,D)$ samples obtained in Step-2 of Trust-Align.
  • ...and 2 more figures