Table of Contents
Fetching ...

RbFT: Robust Fine-tuning for Retrieval-Augmented Generation against Retrieval Defects

Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

TL;DR

RbFT addresses the vulnerability of retrieval-augmented generation to defective retrieval results by training LLMs to detect defective documents and extract useful information from flawed inputs. It introduces two tasks—Defects Detection and Utility Extraction—finely tuned via LoRA to preserve efficiency. Across Natural Questions, HotpotQA, and TriviaQA, RbFT consistently outperforms Vanilla RAG and other baselines under varying defect levels, with notable gains on counterfactual content while keeping inference speeds comparable. The work suggests a practical path to deploying robust, efficient RAG in real-world settings and points to broader applications beyond QA tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.

RbFT: Robust Fine-tuning for Retrieval-Augmented Generation against Retrieval Defects

TL;DR

RbFT addresses the vulnerability of retrieval-augmented generation to defective retrieval results by training LLMs to detect defective documents and extract useful information from flawed inputs. It introduces two tasks—Defects Detection and Utility Extraction—finely tuned via LoRA to preserve efficiency. Across Natural Questions, HotpotQA, and TriviaQA, RbFT consistently outperforms Vanilla RAG and other baselines under varying defect levels, with notable gains on counterfactual content while keeping inference speeds comparable. The work suggests a practical path to deploying robust, efficient RAG in real-world settings and points to broader applications beyond QA tasks.

Abstract

Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge retrieved from a knowledge base. However, its effectiveness is fundamentally constrained by the reliability of both the retriever and the knowledge base. In real-world scenarios, imperfections in these components often lead to the retrieval of noisy, irrelevant, or misleading counterfactual information, ultimately undermining the trustworthiness of RAG systems. To address this challenge, we propose Robust Fine-Tuning (RbFT), a method designed to enhance the resilience of LLMs against retrieval defects through two targeted fine-tuning tasks. Experimental results demonstrate that RbFT significantly improves the robustness of RAG systems across diverse retrieval conditions, surpassing existing methods while maintaining high inference efficiency and compatibility with other robustness techniques.

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of our RbFT. Specifically, RbFT consists of two sub-tasks: Defects Detection and Utility Extraction, which aim to identify the types of retrieval defects and generate the final answer with limited useful information, respectively. In the figure, green text indicates relevant information, while red text represents incorrect counterfactual information.
  • Figure 2: Emperical study: the impact of different types of retrieval defects on Vanilla RAG. The average EM metric on NQ, HQA, and TQA datasets is reported.
  • Figure 3: The effectiveness-robustness trade-off scatter diagram. The x-axis represents effectiveness measured by the EM scores of each model in the Clean setting, and the y-axis represents robustness measured by the EM scores of each model in the Hard + Mix setting.
  • Figure 4: The EM performance of all methods under 4 types of defective data with $\tau = \{0, 0.2, 0.4, 0.6, 0.8, 1.0\}$.
  • Figure 5: Case studies on the attention distribution over input documents of Vanilla RAG and RbFT under different retrieval defects. The greener a document token, the higher the attention it receives during the answer generation process.