Table of Contents
Fetching ...

Characterizing, Evaluating, and Optimizing Complex Reasoning

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng

TL;DR

The paper delivers a unified ME$^2$ framework for characterizing, evaluating, and optimizing complex reasoning in Large Reasoning Models. It models reasoning traces as DAGs and evaluates them via macro- and micro-level abstractions along efficiency and effectiveness, using pairwise preferences to train a Thinking Reward Model (TRM) with a Bradley–Terry objective. A TRM-Preference dataset enables scalable, structure-aware evaluation, and TRM-guided test-time scaling plus TRM-assisted RL optimization yield substantial gains in reasoning quality and task performance. Across diverse tasks, the approach demonstrates that high-quality reasoning acts as a reliable optimization signal, improving outcomes by up to 19.3% at test time and up to 3.9% during training, offering a practical pathway to more reliable, efficient LRM reasoning.

Abstract

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.

Characterizing, Evaluating, and Optimizing Complex Reasoning

TL;DR

The paper delivers a unified ME framework for characterizing, evaluating, and optimizing complex reasoning in Large Reasoning Models. It models reasoning traces as DAGs and evaluates them via macro- and micro-level abstractions along efficiency and effectiveness, using pairwise preferences to train a Thinking Reward Model (TRM) with a Bradley–Terry objective. A TRM-Preference dataset enables scalable, structure-aware evaluation, and TRM-guided test-time scaling plus TRM-assisted RL optimization yield substantial gains in reasoning quality and task performance. Across diverse tasks, the approach demonstrates that high-quality reasoning acts as a reliable optimization signal, improving outcomes by up to 19.3% at test time and up to 3.9% during training, offering a practical pathway to more reliable, efficient LRM reasoning.

Abstract

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
Paper Structure (83 sections, 6 theorems, 55 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 83 sections, 6 theorems, 55 equations, 18 figures, 4 tables, 1 algorithm.

Key Result

Theorem 5.1

Consider the RLVR objective with a binary verifiable reward $r_v\in\{0,1\}$ and the gated reward $r$ defined in Eq. eq:reward. Optimizing Eq. eq:reward preserves the set of optimal policies of the original RLVR objective defined solely by $r_v$.

Figures (18)

  • Figure 1: Overview of our framework. (a)ME$^2$ principle for characterizing reasoning quality (Sec. \ref{['sec:Q1']}). (b) DAG-based reasoning abstraction and pairwise evaluation (Sec. \ref{['sec:Q2']}). (c) TRM training, test-time scaling, and RL optimization (Sec. \ref{['sec:Q3']}).
  • Figure 2: The ME$^2$ principle, characterizing reasoning trace quality along macro/micro granularity and efficiency/effectiveness.
  • Figure 3: Macro- and micro-level abstractions over a reasoning DAG with three canonical structures (progression, branching, and merging). Edges are directed from top to bottom, and node indices indicate step order. Progression: linear continuation. Branching: a node expands into multiple child nodes. Merging: multiple parent nodes converge into a single child node.
  • Figure 4: Best-of-$N$ test-time scaling results on AIME24 and AIME25, comparing Qwen2.5-Math-PRM-7B, ReasonFlux-PRM-7B, and our TRM. The top and bottom rows correspond to using Qwen3-8B and GPT-OSS-20B as the response models, respectively.
  • Figure 5: Pairwise win rates (ties excluded) across different policy models. TRM, Verifier, and ReasonFlux denote policies trained with our TRM, the verifiable reward $r_v$, and ReasonFlux-PRM-7B, respectively. Each bar reports the win rate of Model A against Model B.
  • ...and 13 more figures

Theorems & Definitions (10)

  • Theorem 5.1: Optimal Policy Invariance of Eq. \ref{['eq:reward']}
  • Theorem 5.2: Lower Bound on Policy Improvement
  • Theorem 1: Optimal Policy Invariance
  • proof
  • Theorem 2: Lower Bound on Policy Improvement
  • Lemma 3: Policy Difference
  • proof
  • Lemma 4: Natural policy gradient update
  • proof
  • proof