TARG: Training-Free Adaptive Retrieval Gating for Efficient RAG
Yufeng Wang, Lu wei, Haibin Ling
TL;DR
This work tackles the cost-inefficiency of retrieval in RAG by introducing TARG, a training-free, single-shot gate that decides whether to retrieve using uncertainty signals derived from a short, retrieval-free prefix. Among the proposed signals, the top-1/top-2 margin emerges as the most robust under modern instruction-tuned LLMs, with small-$N$ variance offering a conservative alternative for tight budgets. Calibrated via a simple budget-based threshold, TARG achieves substantial reductions in retrieval and latency while maintaining or improving accuracy across NQ-Open, TriviaQA, and PopQA, and it remains compatible with stronger backbones and retrieval architectures. The results underscore that selective retrieval is preferable to unconditional retrieval and that backbone sharpness shapes which uncertainty signal is most informative, providing practical guidance for deploying efficient RAG systems.
Abstract
Retrieval-Augmented Generation (RAG) improves factuality but retrieving for every query often hurts quality while inflating tokens and latency. We propose Training-free Adaptive Retrieval Gating (TARG), a single-shot policy that decides when to retrieve using only a short, no-context draft from the base model. From the draft's prefix logits, TARG computes lightweight uncertainty scores: mean token entropy, a margin signal derived from the top-1/top-2 logit gap via a monotone link, or small-N variance across a handful of stochastic prefixes, and triggers retrieval only when the score exceeds a threshold. The gate is model agnostic, adds only tens to hundreds of draft tokens, and requires no additional training or auxiliary heads. On NQ-Open, TriviaQA, and PopQA, TARG consistently shifts the accuracy-efficiency frontier: compared with Always-RAG, TARG matches or improves EM/F1 while reducing retrieval by 70-90% and cutting end-to-end latency, and it remains close to Never-RAG in overhead. A central empirical finding is that under modern instruction-tuned LLMs the margin signal is a robust default (entropy compresses as backbones sharpen), with small-N variance offering a conservative, budget-first alternative. We provide ablations over gate type and prefix length and use a delta-latency view to make budget trade-offs explicit.
