Characterizing, Evaluating, and Optimizing Complex Reasoning

Haoran Zhang; Yafu Li; Zhi Wang; Zhilin Wang; Shunkai Zhang; Xiaoye Qu; Yu Cheng

Characterizing, Evaluating, and Optimizing Complex Reasoning

Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng

TL;DR

The paper delivers a unified ME$^2$ framework for characterizing, evaluating, and optimizing complex reasoning in Large Reasoning Models. It models reasoning traces as DAGs and evaluates them via macro- and micro-level abstractions along efficiency and effectiveness, using pairwise preferences to train a Thinking Reward Model (TRM) with a Bradley–Terry objective. A TRM-Preference dataset enables scalable, structure-aware evaluation, and TRM-guided test-time scaling plus TRM-assisted RL optimization yield substantial gains in reasoning quality and task performance. Across diverse tasks, the approach demonstrates that high-quality reasoning acts as a reliable optimization signal, improving outcomes by up to 19.3% at test time and up to 3.9% during training, offering a practical pathway to more reliable, efficient LRM reasoning.

Abstract

Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.

Characterizing, Evaluating, and Optimizing Complex Reasoning

TL;DR

The paper delivers a unified ME

framework for characterizing, evaluating, and optimizing complex reasoning in Large Reasoning Models. It models reasoning traces as DAGs and evaluates them via macro- and micro-level abstractions along efficiency and effectiveness, using pairwise preferences to train a Thinking Reward Model (TRM) with a Bradley–Terry objective. A TRM-Preference dataset enables scalable, structure-aware evaluation, and TRM-guided test-time scaling plus TRM-assisted RL optimization yield substantial gains in reasoning quality and task performance. Across diverse tasks, the approach demonstrates that high-quality reasoning acts as a reliable optimization signal, improving outcomes by up to 19.3% at test time and up to 3.9% during training, offering a practical pathway to more reliable, efficient LRM reasoning.

Abstract

principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.

Paper Structure (83 sections, 6 theorems, 55 equations, 18 figures, 4 tables, 1 algorithm)

This paper contains 83 sections, 6 theorems, 55 equations, 18 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Characterizing High-quality Reasoning.
Reasoning Structure Modeling.
Reward Modeling.
Characterizing Reasoning Trace Quality
Overview
ME$^2$ principle
Macro-Efficiency.
Macro-Effectiveness.
Micro-Efficiency.
Micro-Effectiveness.
Evaluating Reasoning Traces
Step Partitioning
Reasoning Structuring
...and 68 more sections

Key Result

Theorem 5.1

Consider the RLVR objective with a binary verifiable reward $r_v\in\{0,1\}$ and the gated reward $r$ defined in Eq. eq:reward. Optimizing Eq. eq:reward preserves the set of optimal policies of the original RLVR objective defined solely by $r_v$.

Figures (18)

Figure 1: Overview of our framework. (a)ME$^2$ principle for characterizing reasoning quality (Sec. \ref{['sec:Q1']}). (b) DAG-based reasoning abstraction and pairwise evaluation (Sec. \ref{['sec:Q2']}). (c) TRM training, test-time scaling, and RL optimization (Sec. \ref{['sec:Q3']}).
Figure 2: The ME$^2$ principle, characterizing reasoning trace quality along macro/micro granularity and efficiency/effectiveness.
Figure 3: Macro- and micro-level abstractions over a reasoning DAG with three canonical structures (progression, branching, and merging). Edges are directed from top to bottom, and node indices indicate step order. Progression: linear continuation. Branching: a node expands into multiple child nodes. Merging: multiple parent nodes converge into a single child node.
Figure 4: Best-of-$N$ test-time scaling results on AIME24 and AIME25, comparing Qwen2.5-Math-PRM-7B, ReasonFlux-PRM-7B, and our TRM. The top and bottom rows correspond to using Qwen3-8B and GPT-OSS-20B as the response models, respectively.
Figure 5: Pairwise win rates (ties excluded) across different policy models. TRM, Verifier, and ReasonFlux denote policies trained with our TRM, the verifiable reward $r_v$, and ReasonFlux-PRM-7B, respectively. Each bar reports the win rate of Model A against Model B.
...and 13 more figures

Theorems & Definitions (10)

Theorem 5.1: Optimal Policy Invariance of Eq. \ref{['eq:reward']}
Theorem 5.2: Lower Bound on Policy Improvement
Theorem 1: Optimal Policy Invariance
proof
Theorem 2: Lower Bound on Policy Improvement
Lemma 3: Policy Difference
proof
Lemma 4: Natural policy gradient update
proof
proof

Characterizing, Evaluating, and Optimizing Complex Reasoning

TL;DR

Abstract

Characterizing, Evaluating, and Optimizing Complex Reasoning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (10)