Table of Contents
Fetching ...

Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

TL;DR

The paper addresses the bottleneck of fixed context windows in large language models by introducing Recursive Language Models (RLMs), which treat prompts as external environment state and enable the root model to recursively query itself via a persistent REPL. This approach dramatically extends effective prompt length (to 10M+ tokens) and yields strong performance gains on diverse long-context tasks, with costs comparable to or lower than baselines. The authors provide extensive empirical evaluation across multiple benchmarks and frontier models, and they analyze emergent RLM trajectories such as code-based filtering and line-by-line sub-LM transformations. The work suggests a new direction for scaling long-context reasoning and motivates future research into training models to operate as RLMs with asynchronous execution and deeper recursion.

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

Recursive Language Models

TL;DR

The paper addresses the bottleneck of fixed context windows in large language models by introducing Recursive Language Models (RLMs), which treat prompts as external environment state and enable the root model to recursively query itself via a persistent REPL. This approach dramatically extends effective prompt length (to 10M+ tokens) and yields strong performance gains on diverse long-context tasks, with costs comparable to or lower than baselines. The authors provide extensive empirical evaluation across multiple benchmarks and frontier models, and they analyze emergent RLM trajectories such as code-based filtering and line-by-line sub-LM transformations. The work suggests a new direction for scaling long-context reasoning and motivates future research into training models to operate as RLMs with asynchronous execution and deeper recursion.

Abstract

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

Paper Structure

This paper contains 22 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: A comparison of GPT-5 and a corresponding RLM on three long-context tasks of increasing complexity: S-NIAH, OOLONG, and OOLONG-Pairs. For each task, we scale the input length from $2^{13}$ to $2^{18}$. GPT-5 performance degrades significantly as a function of both input length and task complexity, while the RLM maintains strong performance. Inputs beyond the red region do not fit in GPT-5's context window of 272K tokens, but the RLM handles them effectively. Additional experiments across other models, methods, and benchmarks are in §\ref{['sec4:long-input']}.
  • Figure 2: A Recursive Language Model (RLM) treats prompts as part of the environment. It loads the input prompt as a variable inside a Python REPL environment $\mathcal{E}$ and writes code to peek into, decompose, and invoke itself recursively over programmatic snippets of the variable.
  • Figure 3: Cost of RLM and baselines described in §\ref{['sec4.2-methods']} plotted at the 25th, 50th, 75th, and 95th percentile of total API cost. We observe comparable or even lower costs for RLMs at the 50th percentile, but sharp increases at the tail end due to potentially long RLM trajectories.
  • Figure 4: RLMs have common patterns in their trajectories when solving tasks. (a) We frequently observed RLMs filtering and interacting with their context through code like regex queries. (b) We found that RLMs can effectively decompose their context through recursive sub-calls (c) On long-output tasks, RLMs are able to solve sub-problems using recursive sub-LM calls and stitch their outputs to form a final output.
  • Figure 5: Plotted quartiles of the runtime GPT-5 across OOLONG, OOLONG-Pairs, CodeQA, and BrowseComp+ (1K) for all methods described in §\ref{['sec4.2-methods']}. We plot the 25th, 50th, 75th, and 95th percentiles.
  • ...and 5 more figures