Let your LLM generate a few tokens and you will reduce the need for retrieval

Hervé Déjean

Let your LLM generate a few tokens and you will reduce the need for retrieval

Hervé Déjean

TL;DR

The paper tackles when to retrieve in Retrieval-Augmented Generation by training an IK classifier that predicts whether the LLM can answer without external retrieval. It learns this classifier by distilling an LLM-as-judge to generate silver labels on a real QA corpus, and shows that using the IK signal can maintain or improve QA performance while halving retrieval usage, especially when the input includes a small number of tokens from the generated answer. Key findings include around 80% IK accuracy, data efficiency (20k samples with tokens suffice), and robustness across several model families and teacher choices. This approach offers practical gains in RAG efficiency and a means to characterize datasets via IK scores, with important considerations around judge reliability and data quality.

Abstract

In this paper, we investigate how efficiently large language models (LLM) can be trained to check whether an answer is already stored in their parametric memory. We distill an LLM-as-a-judge to compute the IK (I Know) score. We found that this method is particularly beneficial in the context of retrieval-assisted augmented generation (RAG), with a respectable accuracy of 80%. It enables a significant reduction (more than 50%) in the number of search and reranking steps required for certain data sets. We have also introduced the IK score, which serves as a useful tool for characterising datasets by facilitating the classification task. Interestingly, through the inclusion of response tokens as input, our results suggest that only about 20,000 training samples are required to achieve good performance. The central element of this work is the use of a teacher model - the LLM as a judge - to generate training data. We also assess the robustness of the IK classifier by evaluating it with various types of teachers, including both string-based methods and LLMs, with the latter providing better results.

Let your LLM generate a few tokens and you will reduce the need for retrieval

TL;DR

Abstract

Let your LLM generate a few tokens and you will reduce the need for retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)