Characterizing, Evaluating, and Optimizing Complex Reasoning
Haoran Zhang, Yafu Li, Zhi Wang, Zhilin Wang, Shunkai Zhang, Xiaoye Qu, Yu Cheng
TL;DR
The paper delivers a unified ME$^2$ framework for characterizing, evaluating, and optimizing complex reasoning in Large Reasoning Models. It models reasoning traces as DAGs and evaluates them via macro- and micro-level abstractions along efficiency and effectiveness, using pairwise preferences to train a Thinking Reward Model (TRM) with a Bradley–Terry objective. A TRM-Preference dataset enables scalable, structure-aware evaluation, and TRM-guided test-time scaling plus TRM-assisted RL optimization yield substantial gains in reasoning quality and task performance. Across diverse tasks, the approach demonstrates that high-quality reasoning acts as a reliable optimization signal, improving outcomes by up to 19.3% at test time and up to 3.9% during training, offering a practical pathway to more reliable, efficient LRM reasoning.
Abstract
Large Reasoning Models (LRMs) increasingly rely on reasoning traces with complex internal structures. However, existing work lacks a unified answer to three fundamental questions: (1) what defines high-quality reasoning, (2) how to reliably evaluate long, implicitly structured reasoning traces, and (3) how to use such evaluation signals for reasoning optimization. To address these challenges, we provide a unified perspective. (1) We introduce the ME$^2$ principle to characterize reasoning quality along macro- and micro-level concerning efficiency and effectiveness. (2) Built on this principle, we model reasoning traces as directed acyclic graphs (DAGs) and develop a DAG-based pairwise evaluation method, capturing complex reasoning structures. (3) Based on this method, we construct the TRM-Preference dataset and train a Thinking Reward Model (TRM) to evaluate reasoning quality at scale. Experiments show that thinking rewards serve as an effective optimization signal. At test time, selecting better reasoning leads to better outcomes (up to 19.3% gain), and during RL training, thinking rewards enhance reasoning and performance (up to 3.9% gain) across diverse tasks.
