Table of Contents
Fetching ...

Let your LLM generate a few tokens and you will reduce the need for retrieval

Hervé Déjean

TL;DR

The paper tackles when to retrieve in Retrieval-Augmented Generation by training an IK classifier that predicts whether the LLM can answer without external retrieval. It learns this classifier by distilling an LLM-as-judge to generate silver labels on a real QA corpus, and shows that using the IK signal can maintain or improve QA performance while halving retrieval usage, especially when the input includes a small number of tokens from the generated answer. Key findings include around 80% IK accuracy, data efficiency (20k samples with tokens suffice), and robustness across several model families and teacher choices. This approach offers practical gains in RAG efficiency and a means to characterize datasets via IK scores, with important considerations around judge reliability and data quality.

Abstract

In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.

Let your LLM generate a few tokens and you will reduce the need for retrieval

TL;DR

The paper tackles when to retrieve in Retrieval-Augmented Generation by training an IK classifier that predicts whether the LLM can answer without external retrieval. It learns this classifier by distilling an LLM-as-judge to generate silver labels on a real QA corpus, and shows that using the IK signal can maintain or improve QA performance while halving retrieval usage, especially when the input includes a small number of tokens from the generated answer. Key findings include around 80% IK accuracy, data efficiency (20k samples with tokens suffice), and robustness across several model families and teacher choices. This approach offers practical gains in RAG efficiency and a means to characterize datasets via IK scores, with important considerations around judge reliability and data quality.

Abstract

In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.

Paper Structure

This paper contains 14 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 2: Visualisation of the results shown Table \ref{['tab:main']}. The red dotted line indicates the LLMEval score for the No RAG scenario, while the yellow dotted line represents the RAG score. The vertical blue dotted line marks a 50% retrieval rate. The x-axis shows the percentage of retrieval conducted. Results are displayed for both 0 and 32 tokens from the answer.
  • Figure 3: Histogram displaying the distribution of the IK score for six datasets utilizing the Mistral-32 model.
  • Figure 4: Evaluation conducted on six datasets, with the IK model trained using the NQ dataset. Red dotted line correspond to the NO-RAG result, orange one to the RAG results. See also Table \ref{['tab:datasets']}.
  • Figure 5: Impact of the response length used during IK task: a minimal number of tokens (32) is required but the plateau is quickly reached: the gain between 32 and 128 is marginal. See Table \ref{['tab:length']}.
  • Figure 6: Visualization of the effect of varying the number of samples. A fairly small quantity (5k) already allows for decent performances if 32 tokens from the answer are used. See also Table \ref{['tab:training']}.