Table of Contents
Fetching ...

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, Guojie Song

Abstract

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models

Abstract

Recent Large Reasoning Models (LRMs) like DeepSeek-R1 have demonstrated remarkable success in complex reasoning tasks, exhibiting human-like patterns in exploring multiple alternative solutions. Upon closer inspection, however, we uncover a surprising phenomenon: The First is The Best, where alternative solutions are not merely suboptimal but potentially detrimental. This observation challenges widely accepted test-time scaling laws, leading us to hypothesize that errors within the reasoning path scale concurrently with test time. Through comprehensive empirical analysis, we characterize errors as a forest-structured Forest of Errors (FoE) and conclude that FoE makes the First the Best, which is underpinned by rigorous theoretical analysis. Leveraging these insights, we propose RED, a self-guided efficient reasoning framework comprising two components: I) Refining First, which suppresses FoE growth in the first solution; and II) Discarding Subs, which prunes subsequent FoE via dual-consistency. Extensive experiments across five benchmarks and six backbone models demonstrate that RED outperforms eight competitive baselines, achieving performance gains of up to 19.0% while reducing token consumption by 37.7% ~ 70.4%. Moreover, comparative experiments on FoE metrics shed light on how RED achieves effectiveness.

Paper Structure

This paper contains 136 sections, 67 equations, 15 figures, 23 tables.

Figures (15)

  • Figure 1: (Upper) The First is The Best. (Lower) Forest of Errors.
  • Figure 2: (Left) Key observations within the FoE. (Right) Our proposed RED framework.
  • Figure 3: Manual correction on distinct error node types. (Upper) Impact of rectifying individual Grandchild, Child, and Root nodes. (Lower) Consequences of delayed root correction in a formed tree, demonstrating substantial mitigation of subsequent node proliferation.
  • Figure 4: Distribution of various node types with respect to entropy and entropy variance. Experiments were conducted using the Qwen3-8B model on BS-17K-subset.
  • Figure 5: Average distribution of correction types (True, Refuse, Fake) on BS-17k-subset + Qwen3-8B. We manually inject error-signaling prompts to probe early-stage errors ($<20\%$). The green and red vertical lines mark the completion of First and Subs, respectively.
  • ...and 10 more figures