Dynamic Uncertainty Ranking: Enhancing Retrieval-Augmented In-Context Learning for Long-Tail Knowledge in LLMs
Shuyang Yu, Runxue Bao, Parminder Bhatia, Taha Kass-Hout, Jiayu Zhou, Cao Xiao
TL;DR
The paper tackles the instability of retrieval-augmented ICL for long-tail knowledge by introducing a reinforcement-learning-based dynamic uncertainty ranking that reorders retrieved samples according to their per-sample impact on LLM predictions. A learnable budget threshold $\sigma$ reduces query cost by selectively updating the retriever, while a policy-gradient objective guides the retriever to elevate informative samples and suppress misleading ones. Across five QA datasets with GPT-4, the method achieves consistent improvements over strong baselines, particularly boosting long-tail question accuracy by up to $5.96\%$ and averaging $2.97\%$ overall. The approach also demonstrates good efficiency and transferability, suggesting practical applicability for cost-conscious, cross-domain retrieval-augmented ICL in real-world systems.
Abstract
Large language models (LLMs) can learn vast amounts of knowledge from diverse domains during pre-training. However, long-tail knowledge from specialized domains is often scarce and underrepresented, rarely appearing in the models' memorization. Prior work has shown that in-context learning (ICL) with retriever augmentation can help LLMs better capture long-tail knowledge, reducing their reliance on pre-trained data. Despite these advances, we observe that LLM predictions for long-tail questions remain uncertain to variations in retrieved samples. To take advantage of the uncertainty in ICL for guiding LLM predictions toward correct answers on long-tail samples, we propose a reinforcement learning-based dynamic uncertainty ranking method for ICL that accounts for the varying impact of each retrieved sample on LLM predictions. Our approach prioritizes more informative and stable samples while demoting misleading ones, updating rankings based on the feedback from the LLM w.r.t. each retrieved sample. To enhance training efficiency and reduce query costs, we introduce a learnable dynamic ranking threshold, adjusted when the model encounters negative prediction shifts. Experimental results on various question-answering datasets from different domains show that our method outperforms the best baseline by $2.76\%$, with a notable $5.96\%$ boost in accuracy on long-tail questions that elude zero-shot inference.
