Table of Contents
Fetching ...

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

Siyi Li, Jiajun Shi, Shiwen Ni, Ge Zhang, Shuaimin Li, Shijian Wang, Zhoufutu Wen, Yizhi Li, Hamid Alinejad-Rokny, Jiaheng Liu, Min Yang, Wenhao Huang

TL;DR

CoTJudger is introduced, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution.

Abstract

Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.

CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs

TL;DR

CoTJudger is introduced, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution.

Abstract

Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal -- how much of a CoT is necessary versus structurally redundant -- that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
Paper Structure (31 sections, 22 figures, 4 tables)

This paper contains 31 sections, 22 figures, 4 tables.

Figures (22)

  • Figure 1: Two Chain-of-Thought (CoT) traces from DeepSeek-R1 and Gemini-2.5-Pro on a temporal reasoning task. Although both models reach the correct answer, DeepSeek-R1 (left) shows substantial verbosity, repetition, and multiple reflection/correction loops. In contrast, Gemini-2.5-Pro (right) follows a more direct and efficient path with minimal additional exploration.
  • Figure 2: Automatic evaluation framework of CoTJudger. The pipeline comprises six modules: (1) Step Segmentation and Atomization, (2) Atomic Node Classification, (3) Answer Node Detection and Verification, (4) CoT Graph Construction, (5) Path Extraction and Validation, and (6) Redundancy Metrics Calculation.
  • Figure 3: Positional distribution of redundant reasoning steps in CoT (KDE). The plot shows the normalized probability density of steps outside the Shortest Effective Path across models.
  • Figure 4: Comparative analysis of CoT token-length distributions, examining the effects of model variants (Pro/Base vs. Flash-Thinking) and parameter scaling on reasoning redundancy.
  • Figure 5: Functional role distribution of CoT steps across four domains (General Reasoning, Math, Programming, and PCB). Each chart shows the proportions of universal and domain-specific reasoning roles, highlighting shared structure and domain-adaptive patterns in LRMs.
  • ...and 17 more figures