Table of Contents
Fetching ...

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee

TL;DR

ReJump introduces a two-layer tree-jump representation to analyze LLM reasoning traces, enabling quantitative study of exploration, verification, and forgetting beyond final accuracy. It provides an end-to-end pipeline with ReJump-Extractor to convert CoTs into structured trees and jumps, plus six behavioral metrics and two similarity measures for robust cross-model comparison. The work demonstrates that models with similar accuracy can exhibit substantially different reasoning styles across tasks, and shows that test-time strategies like Best-of-N and prompt selection guided by ReJump can improve reasoning outcomes. Overall, ReJump offers both a diagnostic tool for understanding LLM reasoning and practical avenues for improving reasoning quality in real-time settings.

Abstract

Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at https://github.com/UW-Madison-Lee-Lab/ReJump.

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

TL;DR

ReJump introduces a two-layer tree-jump representation to analyze LLM reasoning traces, enabling quantitative study of exploration, verification, and forgetting beyond final accuracy. It provides an end-to-end pipeline with ReJump-Extractor to convert CoTs into structured trees and jumps, plus six behavioral metrics and two similarity measures for robust cross-model comparison. The work demonstrates that models with similar accuracy can exhibit substantially different reasoning styles across tasks, and shows that test-time strategies like Best-of-N and prompt selection guided by ReJump can improve reasoning outcomes. Overall, ReJump offers both a diagnostic tool for understanding LLM reasoning and practical avenues for improving reasoning quality in real-time settings.

Abstract

Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at https://github.com/UW-Madison-Lee-Lab/ReJump.

Paper Structure

This paper contains 59 sections, 2 equations, 19 figures, 13 tables.

Figures (19)

  • Figure 1: ReJump representations of reasoning traces generated by Claude 3.7 Sonnet, Grok 3 Mini Beta, and DeepSeek-R1 on a Game of 24 problem. All three models arrive at the same final answer, but their reasoning behaviors differ. Here, both Claude 3.7 Sonnet and Grok 3 Mini Beta follow a single linear reasoning path; however, Claude 3.7 Sonnet adopts the answer without verification, while Grok 3 Mini Beta verifies it before concluding. In contrast, DeepSeek-R1 explores multiple solution paths, exhibiting more deliberate behaviors such as backtracking and verification.
  • Figure 2: Illustration of how $d_{\text{jump}}$ quantifies the exploration-exploitation trade-off in model reasoning. Given a sequence of visited leaf nodes $(v_1, v_2, v_3)$, the left panel depicts a trace exhibiting local exploration (shorter paths between nodes), while the right panel shows a trace with larger jumps to distant leaves, reflecting more global exploration.
  • Figure 3: Illustration of how ReJump-Extractor converts a reasoning trace into the ReJump representation for a math word problem. This example is crafted for demonstration purposes. Nodes represent partial solutions, and tree edges indicate that the parent nodes serve as prerequisite for child nodes. Dashed arrows show how the reasoning moves between nodes, with each transitions corresponds to an action type (calc, verify, or backtrack), and color-coded accordingly.
  • Figure 4: ReJump representations extracted by ReJump-Extractor for reasoning traces generated by DeepSeek-R1, Phi-4-reasoning-plus, and Claude 3.7 Sonnet for a Game of 24 problem.
  • Figure 5: Reasoning performance of DeepSeek-R1, Grok 3 Mini Beta, QwQ-32B, Phi-4-reasoning-plus, and Claude 3.7 Sonnet on MATH-500 and Game of 24. The bar plots present the final accuracy (pass$@$1), while the radar plots detail six reasoning metrics. For comparability, solution count and jump distance are normalized across all models and datasets. To ensure that higher values consistently reflect preferred behavior, we report the non-forgetting rate and non-overthinking rate rather than forgetting rate and overthinking rate. The results show that models display distinct reasoning behaviors across datasets. Furthermore, even when models achieve similar final performance, their underlying reasoning processes can differ significantly. To better highlight metric differences among the strongest models DeepSeek-R1, Grok 3 Mini Beta, and Claude 3.7 Sonnet, \ref{['fig:benchmark_top3']} focuses exclusively on these three.
  • ...and 14 more figures