Modeling Uncertainty Trends for Timely Retrieval in Dynamic RAG
Bo Li, Tian Tian, Zhenghua Xu, Hao Cheng, Shikun Zhang, Wei Ye
TL;DR
This work tackles delayed retrieval in dynamic retrieval-augmented generation by introducing Entropy-Trend Constraint (ETC), a training-free method that models the dynamics of token-level uncertainty through entropy trends. ETC computes an entropy sequence and uses first- and second-order differences, combined with a dynamic smoothing scheme, to trigger retrieval at earlier and more appropriate times during decoding. Across six QA benchmarks and three LLM backbones, ETC consistently outperforms strong baselines while reducing retrieval frequency and maintaining robustness in domain-specific settings. The approach is plug-and-play, model-agnostic, and demonstrates practical impact for timely, efficient, and accurate knowledge injection in RAG systems.
Abstract
Dynamic retrieval-augmented generation (RAG) allows large language models (LLMs) to fetch external knowledge on demand, offering greater adaptability than static RAG. A central challenge in this setting lies in determining the optimal timing for retrieval. Existing methods often trigger retrieval based on low token-level confidence, which may lead to delayed intervention after errors have already propagated. We introduce Entropy-Trend Constraint (ETC), a training-free method that determines optimal retrieval timing by modeling the dynamics of token-level uncertainty. Specifically, ETC utilizes first- and second-order differences of the entropy sequence to detect emerging uncertainty trends, enabling earlier and more precise retrieval. Experiments on six QA benchmarks with three LLM backbones demonstrate that ETC consistently outperforms strong baselines while reducing retrieval frequency. ETC is particularly effective in domain-specific scenarios, exhibiting robust generalization capabilities. Ablation studies and qualitative analyses further confirm that trend-aware uncertainty modeling yields more effective retrieval timing. The method is plug-and-play, model-agnostic, and readily integrable into existing decoding pipelines. Implementation code is included in the supplementary materials.
