Table of Contents
Fetching ...

QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors

Kepu Zhang, Zhongxiang Sun, Weijie Yu, Xiaoxue Zang, Kai Zheng, Yang Song, Han Li, Jun Xu

TL;DR

QE-RAG identifies a critical robustness gap in retrieval-augmented generation under realistic query-entry errors and introduces a benchmark built by injecting keyboard proximity, visual similarity, and spelling errors into six QA datasets at two noise levels. It demonstrates that state-of-the-art RAG methods are fragile to corrupted queries, and proposes two solutions: a contrastive-learning-based robust retriever and a retrieval-augmented query-correction approach. Empirical results show substantial robustness improvements and compatibility with existing RAG pipelines across in-domain and cross-domain settings. The work provides a practical framework for evaluating and improving RAG systems in real-world user interactions.

Abstract

Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors such as keyboard proximity errors, visual similarity errors, and spelling errors are frequent. The impact of these errors on current RAG methods against such errors remains largely unexplored. To bridge this gap, we propose QE-RAG, the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We augment six widely used datasets by injecting three common types of query entry errors into randomly selected user queries at rates of 20\% and 40\%, simulating typical user behavior in real-world scenarios. We analyze the impact of these errors on LLM outputs and find that corrupted queries degrade model performance, which can be mitigated through query correction and training a robust retriever for retrieving relevant documents. Based on these insights, we propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method. Extensive in-domain and cross-domain experiments reveal that: (1) state-of-the-art RAG methods including sequential, branching, and iterative methods, exhibit poor robustness to query entry errors; (2) our method significantly enhances the robustness of RAG when handling query entry errors and it's compatible with existing RAG methods, further improving their robustness.

QE-RAG: A Robust Retrieval-Augmented Generation Benchmark for Query Entry Errors

TL;DR

QE-RAG identifies a critical robustness gap in retrieval-augmented generation under realistic query-entry errors and introduces a benchmark built by injecting keyboard proximity, visual similarity, and spelling errors into six QA datasets at two noise levels. It demonstrates that state-of-the-art RAG methods are fragile to corrupted queries, and proposes two solutions: a contrastive-learning-based robust retriever and a retrieval-augmented query-correction approach. Empirical results show substantial robustness improvements and compatibility with existing RAG pipelines across in-domain and cross-domain settings. The work provides a practical framework for evaluating and improving RAG systems in real-world user interactions.

Abstract

Retriever-augmented generation (RAG) has become a widely adopted approach for enhancing the factual accuracy of large language models (LLMs). While current benchmarks evaluate the performance of RAG methods from various perspectives, they share a common assumption that user queries used for retrieval are error-free. However, in real-world interactions between users and LLMs, query entry errors such as keyboard proximity errors, visual similarity errors, and spelling errors are frequent. The impact of these errors on current RAG methods against such errors remains largely unexplored. To bridge this gap, we propose QE-RAG, the first robust RAG benchmark designed specifically to evaluate performance against query entry errors. We augment six widely used datasets by injecting three common types of query entry errors into randomly selected user queries at rates of 20\% and 40\%, simulating typical user behavior in real-world scenarios. We analyze the impact of these errors on LLM outputs and find that corrupted queries degrade model performance, which can be mitigated through query correction and training a robust retriever for retrieving relevant documents. Based on these insights, we propose a contrastive learning-based robust retriever training method and a retrieval-augmented query correction method. Extensive in-domain and cross-domain experiments reveal that: (1) state-of-the-art RAG methods including sequential, branching, and iterative methods, exhibit poor robustness to query entry errors; (2) our method significantly enhances the robustness of RAG when handling query entry errors and it's compatible with existing RAG methods, further improving their robustness.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Examples of three types of query entry errors including keyboard proximity errors, visual similarity errors, and spelling errors.
  • Figure 2: The construction process of QE-RAG datasets. (a) Selection of RAG Dataset. (b) Query corruption through three scenarios: keyboard proximity errors, visual similarity errors and spelling errors. (c) Label matching.
  • Figure 3: Preliminary experiments to explore the impact of query entry errors on RAG performance, where the retriever is BGE, with error ratios of 40% and 20%.
  • Figure 4: The robustness comparison of correct and corrupted queries to the average F$_1$ score when the retrieval model is Standard RAG and RA-QCG, the generative model is Llama3 and the error rate is 20%. Above and below the X-axis represent the average token level F$_1$ value of the correct and corrupted query, respectively.
  • Figure 5: The results of Standard RAG (RAG), Direct-Correct (DC) and RA-QCG retrieving varying numbers of documents on six datasets when the LLM is Llama3 and the error rate is 20%. The x-axis represents the number of retrieved documents, specifically 1, 3, 5, and 15, while the y-axis indicates the token-level F$_1$ score.
  • ...and 1 more figures