Table of Contents
Fetching ...

Think, But Don't Overthink: Reproducing Recursive Language Models

Daren Wang

TL;DR

This project reproduces and extends the recently proposed ``Recursive Language Models''(RLMs) framework and investigates the impact of scaling the recursion depth.

Abstract

This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction

Think, But Don't Overthink: Reproducing Recursive Language Models

TL;DR

This project reproduces and extends the recently proposed ``Recursive Language Models''(RLMs) framework and investigates the impact of scaling the recursion depth.

Abstract

This project reproduces and extends the recently proposed ``Recursive Language Models'' (RLMs) framework by Zhang et al. (2026). This framework enables Large Language Models (LLMs) to process near-infinite contexts by offloading the prompt into an external REPL environment. While the original paper relies on a default recursion depth of 1 and suggests deeper recursion as a future direction, this study specifically investigates the impact of scaling the recursion depth. Using state-of-the-art open-source agentic models (DeepSeek v3.2 and Kimi K2), I evaluated pure LLM, RLM (depth=1), and RLM (depth=2) on the S-NIAH and OOLONG benchmarks. The findings reveal a compelling phenomenon: Deeper recursion causes models to ``overthink''. While depth-1 RLMs effectively boost accuracy on complex reasoning tasks, applying deeper recursion (depth=2) or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time (e.g., from 3.6s to 344.5s) and token costs. Code and data are available at: https://github.com/drbillwang/rlm-reproduction
Paper Structure (19 sections, 5 figures)

This paper contains 19 sections, 5 figures.

Figures (5)

  • Figure 1: Performance comparison of Base LLM, RLM (Depth=1), and RLM (Depth=2) against the original paper's benchmarks.
  • Figure 2: Average Execution Time (seconds) across different models and recursion depths.
  • Figure 3: Average Token Usage (thousands) across different models and recursion depths.
  • Figure 4: Average Token Cost (US$ cents) across different models and recursion depths.
  • Figure 5: Qualitative examples of RLM trajectory failures. Deeper recursion (Depth=2) often induces parametric hallucinations, role-playing confusion within the REPL, and performative over-explanation.