Think, But Don't Overthink: Reproducing Recursive Language Models

Daren Wang

Think, But Don't Overthink: Reproducing Recursive Language Models

Daren Wang

TL;DR

This project reproduces and extends the recently proposed ``Recursive Language Models''(RLMs) framework and investigates the impact of scaling the recursion depth.

Abstract

This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

Think, But Don't Overthink: Reproducing Recursive Language Models

TL;DR

This project reproduces and extends the recently proposed ``Recursive Language Models''(RLMs) framework and investigates the impact of scaling the recursion depth.

Abstract

Paper Structure (19 sections, 5 figures)

This paper contains 19 sections, 5 figures.

Introduction
Setup Notes
Environment and Core Libraries
Data
API Keys and Configuration
Compute
Reproduction Targets & Metric Definition
Results and Analysis
Paradoxical Degradation on Simple Retrieval ($O(1)$ Tasks)
The "Overthinking" Effect on Complex Reasoning ($O(N)$ Tasks)
Barriers to Industrial Deployment: Time, Tokens, and Cost
Qualitative Analysis: How Deep Recursion Breaks Models
Limitation: Single-Run Results
Conclusions and Future Directions
Debug Diary
...and 4 more sections

Figures (5)

Figure 1: Performance comparison of Base LLM, RLM (Depth=1), and RLM (Depth=2) against the original paper's benchmarks.
Figure 2: Average Execution Time (seconds) across different models and recursion depths.
Figure 3: Average Token Usage (thousands) across different models and recursion depths.
Figure 4: Average Token Cost (US$ cents) across different models and recursion depths.
Figure 5: Qualitative examples of RLM trajectory failures. Deeper recursion (Depth=2) often induces parametric hallucinations, role-playing confusion within the REPL, and performative over-explanation.

Think, But Don't Overthink: Reproducing Recursive Language Models

TL;DR

Abstract

Think, But Don't Overthink: Reproducing Recursive Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)