ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning
Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee
TL;DR
ReJump introduces a two-layer tree-jump representation to analyze LLM reasoning traces, enabling quantitative study of exploration, verification, and forgetting beyond final accuracy. It provides an end-to-end pipeline with ReJump-Extractor to convert CoTs into structured trees and jumps, plus six behavioral metrics and two similarity measures for robust cross-model comparison. The work demonstrates that models with similar accuracy can exhibit substantially different reasoning styles across tasks, and shows that test-time strategies like Best-of-N and prompt selection guided by ReJump can improve reasoning outcomes. Overall, ReJump offers both a diagnostic tool for understanding LLM reasoning and practical avenues for improving reasoning quality in real-time settings.
Abstract
Large Reasoning Models (LRMs) are Large Language Models (LLMs) explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at https://github.com/UW-Madison-Lee-Lab/ReJump.
