Table of Contents
Fetching ...

Efficient Test-Time Retrieval Augmented Generation

Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo

TL;DR

ET2RAG tackles the factuality and efficiency problems of large language models by training-free integration of retrieval with consensus. It introduces Stable Organized Retrieval to structure external evidence and Fast Consensus Integration that uses partial generation and a majority-voting mechanism to identify the most reliable answer, reducing computation while maintaining or improving accuracy. The approach yields consistent improvements across open-domain QA, recipe generation, and image captioning, and it analyzes the impact of vote size and response length to reveal favorable efficiency-accuracy trade-offs. This framework offers a practical, scalable path to robust retrieval-augmented generation in diverse multimodal tasks.

Abstract

Although Large Language Models (LLMs) demonstrate significant capabilities, their reliance on parametric knowledge often leads to inaccuracies. Retrieval Augmented Generation (RAG) mitigates this by incorporating external knowledge, but these methods may introduce irrelevant retrieved documents, leading to inaccurate responses. While the integration methods filter out incorrect answers from multiple responses, but lack external knowledge like RAG methods, and their high costs require balancing overhead with performance gains. To address these issues, we propose an Efficient Test-Time Retrieval-Augmented Generation Framework named ET2RAG to improve the performance of LLMs while maintaining efficiency. Specifically, ET2RAG is a training-free method, that first retrieves the most relevant documents and augments the LLMs to efficiently generate diverse candidate responses by managing response length. Then we compute the similarity of candidate responses and employ a majority voting mechanism to select the most suitable response as the final output. In particular, we discover that partial generation is sufficient to capture the key information necessary for consensus calculation, allowing us to effectively perform majority voting without the need for fully generated responses. Thus, we can reach a balance between computational cost and performance by managing the response length for the number of retrieved documents for majority voting. Experimental results demonstrate that ET2RAG significantly enhances performance across three tasks, including open-domain question answering, recipe generation and image captioning.

Efficient Test-Time Retrieval Augmented Generation

TL;DR

ET2RAG tackles the factuality and efficiency problems of large language models by training-free integration of retrieval with consensus. It introduces Stable Organized Retrieval to structure external evidence and Fast Consensus Integration that uses partial generation and a majority-voting mechanism to identify the most reliable answer, reducing computation while maintaining or improving accuracy. The approach yields consistent improvements across open-domain QA, recipe generation, and image captioning, and it analyzes the impact of vote size and response length to reveal favorable efficiency-accuracy trade-offs. This framework offers a practical, scalable path to robust retrieval-augmented generation in diverse multimodal tasks.

Abstract

Although Large Language Models (LLMs) demonstrate significant capabilities, their reliance on parametric knowledge often leads to inaccuracies. Retrieval Augmented Generation (RAG) mitigates this by incorporating external knowledge, but these methods may introduce irrelevant retrieved documents, leading to inaccurate responses. While the integration methods filter out incorrect answers from multiple responses, but lack external knowledge like RAG methods, and their high costs require balancing overhead with performance gains. To address these issues, we propose an Efficient Test-Time Retrieval-Augmented Generation Framework named ET2RAG to improve the performance of LLMs while maintaining efficiency. Specifically, ET2RAG is a training-free method, that first retrieves the most relevant documents and augments the LLMs to efficiently generate diverse candidate responses by managing response length. Then we compute the similarity of candidate responses and employ a majority voting mechanism to select the most suitable response as the final output. In particular, we discover that partial generation is sufficient to capture the key information necessary for consensus calculation, allowing us to effectively perform majority voting without the need for fully generated responses. Thus, we can reach a balance between computational cost and performance by managing the response length for the number of retrieved documents for majority voting. Experimental results demonstrate that ET2RAG significantly enhances performance across three tasks, including open-domain question answering, recipe generation and image captioning.

Paper Structure

This paper contains 16 sections, 12 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison of retrieval and integration strategies. (a) Traditional RAG: Retrieves documents from an external database and feeds them directly into a language model, which may lead to noisy or inconsistent outputs. (b) Self-Consistency Methods: Rely on the stochasticity of LLM decoding to generate multiple reasoning paths and select an answer via majority voting, but lack access to external knowledge. (c) Our ET$^2$RAG : Integrates retrieval augmentation into self-consistency by first organizing retrieved results into diverse and stable subsets. These subsets are used to generate truncated outputs, which are evaluated via consensus to select the most reliable response. Full generation is performed only once on the best retrieval set, significantly reducing computational overhead while enhancing accuracy and robustness.
  • Figure 2: Our proposed ET$^2$RAG consists of two stages: Stable Organized Retrieval and Fast Consensus Integration. In the former stage, the input task query (either text or image) is passed to the Retriever, which retrieves multiple relevant results from the database ($R(x)$). These retrieved results are then organized into independent combinations in the organized set ($S$) through the Organization process. These combinations form the basis for the subsequent generation phase. In the latter stage, the organized combinations are concatenated with the task query and input into the Fast Generation module to generate truncated outputs of fixed length. Next, Consensus Negotiation is applied to calculate the similarity between these outputs, resulting in a similarity matrix ($M$). By summing the elements of this matrix, we compute the Agreement Scores ($A$) for each output. Finally, the output corresponding to the highest agreement score is selected, and its associated combination is used to generate the final complete output.
  • Figure 3: Qualitative results are presented, with responses matching the ground truth in yellow and incorrect outputs in red.
  • Figure 4: Results for TriviaQA, Recipe1M, and CoCo. Figure 1-a illustrates the (acc) of ET$^2$RAG (DeepSeek-R1-Llama$_{8B}$) for TriviaQA. Figure 2-a shows the 'Eval Sum', representing the combined scores of BLEU, SacreBLEU, and Rouge-L for Recipe1M. Figure 3-a displays the 'Eval Sum' for CoCo Captioning, calculated as the sum of BLEU-4, METEOR, and CIDEr scores. The RAG baseline represents results from the traditional RAG method. Details of the metrics are in supplementary material Section 2, A. Figures 1-b, 2-b, and 3-b detail the 'Computation Cost', capturing the average number of additional tokens produced per response with varying $L$ and $V$ in each task. Figures 1-c, 2-c, and 3-c depict the respective Pareto Frontiers for each task. These frontiers optimize the key metrics ('Eval Sum' or accuracy) and 'Computation Cost' by the strategic adjustment of parameters $L$ and $V$. Each task's optimization strategy equally prioritizes the maximization of its specific evaluation metrics and the minimization of computation costs. Note that Figures 1-a, 2-a, and 3-a share legends with Figures 1-b, 2-b, and 3-b, respectively.
  • Figure 5: Trade-off between performance and total computation cost on PopQA.
  • ...and 1 more figures