Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Srijan Shakya; Anamaria-Roberta Hartl; Sepp Hochreiter; Korbinian Pöppel

Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Srijan Shakya, Anamaria-Roberta Hartl, Sepp Hochreiter, Korbinian Pöppel

TL;DR

The paper addresses the tendency of LLMs to falter on complex reasoning due to static knowledge by introducing an adaptive retrieval-augmented reasoning agent that uses on-demand retrieval within a CoT framework. It demonstrates that selectively querying an external knowledge base improves performance on challenging math benchmarks (notably +$6.4$pp on MATH-500) while static, non-targeted retrieval can harm reasoning, and even when retrieval is not used, the model can exhibit strong performance, indicating a meaningful metacognitive signal. The key contribution is showing retrieval's value lies in when and how it is used, not merely in the retrieved content, and that retrieval decisions scale with problem difficulty. This points to a general principle for building robust generative models: treat retrieval as a dynamic, agentic, metacognitive tool for uncertainty-aware knowledge integration, rather than a fixed augmentation.

Abstract

Large Language Models (LLMs) often falter in complex reasoning tasks due to their static, parametric knowledge, leading to hallucinations and poor performance in specialized domains like mathematics. This work explores a fundamental principle for enhancing generative models: treating retrieval as a form of dynamic in-context learning. We test an adaptive retrieval-augmented architecture where an LLM agent actively decides when to query an external knowledge base during its reasoning process. We compare this adaptive strategy against a standard Chain-of-Thought (CoT) baseline and a static retrieval approach on the GSM8K and MATH-500 benchmarks. Although our experiments show that static retrieval is inferior to CoT, the adaptive retrieval shows interesting behavior: While traces including retrieved results show slightly worse performance compared to CoT, traces that do not include retrieval actually perform better compared to CoT. This suggests that: (a) retrieval only rarely helps reasoning (we show a few counterexamples, e.g. using useful theorems) and (b) actively not using retrieval is indicative of good model performance. Furthermore, we find that the model scales its retrieval frequency with the difficulty of the problem, reinforcing that the decision to retrieve is a crucial metacognitive signal. The agent's ability to self-assess its knowledge and selectively engage with external information represents a key principle for building more robust and reliable generative models.

Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

TL;DR

pp on MATH-500) while static, non-targeted retrieval can harm reasoning, and even when retrieval is not used, the model can exhibit strong performance, indicating a meaningful metacognitive signal. The key contribution is showing retrieval's value lies in when and how it is used, not merely in the retrieved content, and that retrieval decisions scale with problem difficulty. This points to a general principle for building robust generative models: treat retrieval as a dynamic, agentic, metacognitive tool for uncertainty-aware knowledge integration, rather than a fixed augmentation.

Abstract

Paper Structure (28 sections, 1 equation, 4 figures, 5 tables)

This paper contains 28 sections, 1 equation, 4 figures, 5 tables.

Introduction
Method: An Adaptive Retrieval-Augmented Reasoning Agent
Core Language Model
Retrieval Module
Experiments and Results
Reasoning Strategies
Overall Performance
Analysis of the Adaptive Agent's Behavior
Retrieval Decision Analysis
Discussion and Conclusion
Appendix
Experimental Setup and Hyperparameter Configuration
Prompt Templates
Chat Formatting (Llama 3 Template)
LLM with No CoT (System Prompt)
...and 13 more sections

Figures (4)

Figure 1: The dynamic context of the Adaptive Retrieval-CoT agent. Unlike a standard prompt, the context is an evolving transcript. The key difference is the agent's ability to generate a <search> tag, which pauses generation. The system then executes the query and injects the results back into the context, allowing the agent to resume its reasoning with new, targeted information.
Figure 2: Performance comparison of the three reasoning strategies.
Figure 3: Contingency: CoT vs Adaptive Retrieval-CoT on MATH-500, ✔ means correctly solved, ✘ means the method didn't solve the task. Top line indicates decision boundary for better method.
Figure 4: Contingency: CoT vs Adaptive Retrieval-CoT on GSM8K, ✔ means correctly solved, ✘ means the method didn't solve the task. Top line indicates decision boundary for better method.

Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

TL;DR

Abstract

Adaptive Retrieval helps Reasoning in LLMs -- but mostly if it's not used

Authors

TL;DR

Abstract

Table of Contents

Figures (4)