Table of Contents
Fetching ...

TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG

Yufeng Wang, Lu wei, Haibin Ling

TL;DR

This work tackles the cost-inefficiency of retrieval in RAG by introducing TARG, a training-free, single-shot gate that decides whether to retrieve using uncertainty signals derived from a short, retrieval-free prefix. Among the proposed signals, the top-1/top-2 margin emerges as the most robust under modern instruction-tuned LLMs, with small-$N$ variance offering a conservative alternative for tight budgets. Calibrated via a simple budget-based threshold, TARG achieves substantial reductions in retrieval and latency while maintaining or improving accuracy across NQ-Open, TriviaQA, and PopQA, and it remains compatible with stronger backbones and retrieval architectures. The results underscore that selective retrieval is preferable to unconditional retrieval and that backbone sharpness shapes which uncertainty signal is most informative, providing practical guidance for deploying efficient RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.

TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG

TL;DR

This work tackles the cost-inefficiency of retrieval in RAG by introducing TARG, a training-free, single-shot gate that decides whether to retrieve using uncertainty signals derived from a short, retrieval-free prefix. Among the proposed signals, the top-1/top-2 margin emerges as the most robust under modern instruction-tuned LLMs, with small- variance offering a conservative alternative for tight budgets. Calibrated via a simple budget-based threshold, TARG achieves substantial reductions in retrieval and latency while maintaining or improving accuracy across NQ-Open, TriviaQA, and PopQA, and it remains compatible with stronger backbones and retrieval architectures. The results underscore that selective retrieval is preferable to unconditional retrieval and that backbone sharpness shapes which uncertainty signal is most informative, providing practical guidance for deploying efficient RAG systems.

Abstract

Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.

Paper Structure

This paper contains 35 sections, 4 theorems, 23 equations, 1 figure, 6 tables, 1 algorithm.

Key Result

Lemma 1

Let $\phi:\mathbb{R}\to\mathbb{R}$ be strictly decreasing. Fix any "shape" vector $\delta=(\delta_1,\dots,\delta_k)\in\mathbb{R}^k$ with $\sum_{t=1}^k \delta_t = 0$, and define the gap sequence along the location family Define the margin-based gate statistic Then $U_{\mathrm{mar}}(\mu)$ is strictly decreasing in $\mu$. Consequently, for any threshold $\tau\in\mathbb{R}$ there exists a (unique) v

Figures (1)

  • Figure 1: Illustration of TARG methodology

Theorems & Definitions (9)

  • Lemma 1: Location-equivalence of the margin gate
  • proof
  • Remark : Scope
  • Lemma 2
  • proof
  • Proposition 1: Weak dominance over Never
  • proof
  • Proposition 2: Dominance over Always under one-sided sign
  • proof