Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Keivan Alizadeh; Parshin Shojaee; Minsik Cho; Mehrdad Farajtabar

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Keivan Alizadeh, Parshin Shojaee, Minsik Cho, Mehrdad Farajtabar

Abstract

Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Abstract

Paper Structure (36 sections, 13 figures, 1 table)

This paper contains 36 sections, 13 figures, 1 table.

Introduction
Methodology
Problem Formulation
SRLM: Self-Reflective Program Search for Long Context
Uncertainty Signals
Sampling-based Uncertainty (Self-Consistency).
Semantic Uncertainty (Verbalized Confidence).
Behavioral Uncertainty (Reasoning Length).
Joint Uncertainty-guided Selection
Experiments
Datasets
Baselines
Experimental Setup
Main Results
Robustness Across Context Lengths
...and 21 more sections

Figures (13)

Figure 1: Overview of SRLM, a framework that augments programmatic context interaction reasoning with uncertainty-aware self-reflection. The language model operates in a self-query execution programming environment where the context is externalized as a variable, and generates programs that query and interact with context. Meanwhile, three complementary uncertainty signals (self-consistency, reasoning trace length, and verbalized confidence) are used to guide self-reflective programming trajectory selection without external supervision, enabling more robust and semantically grounded long-context reasoning.
Figure 2: Performance across context lengths on OOLONG and LongBench-v2 Full datasets: Line plots show accuracy of SRLM, RLM, and the base LLM across context from thousands to millions of tokens using GPT-5 (left) and Qwen3-Coder-480B (right) backbones. Bar plots show the average performance gain over the base model, separated into contexts within (${<}131$K) and near/beyond (${\geq}131$K) the native context window.
Figure 3: Accuracy versus cost pareto comparison of RLM and SRLM (no sub-call) on long-context settings of benchmarks under GPT-5 (left) and Qwen3-Coder-480B (right).
Figure 4: Comparison of SRLM, RLM, and Base LLM across LongBench-v2 domains (averaged across backbone models). In general, SRLM variants show more consistent gains on tasks with different semantic nature.
Figure 5: Results of ablation experiments across SRLM's variants (averaged across backbones and recursive/nonrecursive runs). Top: Contribution of each uncertainty signal and their combination in SRLM. Bottom: Complementary effects of semantic and behavioral uncertainty as fine-grained signals guiding self-reflection in SRLM.
...and 8 more figures

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Abstract

Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Authors

Abstract

Table of Contents

Figures (13)