Table of Contents
Fetching ...

Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG

Bo Li, Tian Tian, Zhenghua Xu, Hao Cheng, Shikun Zhang, Wei Ye

TL;DR

This work tackles delayed retrieval in dynamic retrieval-augmented generation by introducing Entropy-Trend Constraint (ETC), a training-free method that models the dynamics of token-level uncertainty through entropy trends. ETC computes an entropy sequence and uses first- and second-order differences, combined with a dynamic smoothing scheme, to trigger retrieval at earlier and more appropriate times during decoding. Across six QA benchmarks and three LLM backbones, ETC consistently outperforms strong baselines while reducing retrieval frequency and maintaining robustness in domain-specific settings. The approach is plug-and-play, model-agnostic, and demonstrates practical impact for timely, efficient, and accurate knowledge injection in RAG systems.

Abstract

Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.

Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG

TL;DR

This work tackles delayed retrieval in dynamic retrieval-augmented generation by introducing Entropy-Trend Constraint (ETC), a training-free method that models the dynamics of token-level uncertainty through entropy trends. ETC computes an entropy sequence and uses first- and second-order differences, combined with a dynamic smoothing scheme, to trigger retrieval at earlier and more appropriate times during decoding. Across six QA benchmarks and three LLM backbones, ETC consistently outperforms strong baselines while reducing retrieval frequency and maintaining robustness in domain-specific settings. The approach is plug-and-play, model-agnostic, and demonstrates practical impact for timely, efficient, and accurate knowledge injection in RAG systems.

Abstract

Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.

Paper Structure

This paper contains 24 sections, 12 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The delayed retrieval issue exists in current dynamic RAG method, where blue tokens represent DRAGIN's retrieval timing, and red tokens highlight incorrectly generated tokens caused by delayed retrieval.
  • Figure 2: The win rate using GPT-4o as judge. The value in each bracket indicates the percentage of times ETC's answer quality is equal to or better than DRAGIN's on the corresponding dataset.
  • Figure 3: The heat-map of retrieval timing and the entropy distribution.
  • Figure 4: Illustrative cases of delayed retrieval. The first two cases demonstrate delayed retrieval, where green tokens indicate ETC's retrieval timing, blue tokens represent DRAGIN's retrieval timing, and red tokens highlight incorrectly generated tokens caused by delayed retrieval. The last two cases illustrate missing retrieval, which is a special case of delayed retrieval.
  • Figure 5: The prompt used to evaluate the answer quality in our paper.