Table of Contents
Fetching ...

Decomposing Reasoning Efficiency in Large Language Models

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

TL;DR

This paper introduces a trace-optional framework to dissect reasoning efficiency in large language models by decomposing token usage into interpretable components. It begins with an outcome-level analysis (E0) that factors efficiency into robustness against token-budget truncation, logic robustness, and verbosity, and then extends to workload-normalized verbosity using a deterministic per-instance workload proxy W_poi, introducing verbalization overhead VO and a workload-coupling coefficient κ. When reasoning traces are available, the framework adds a trace-quality decomposition that separates grounded, non-redundant signal from degenerate or prompt-copied content, via deterministic measures of grounding, repetition, and prompt copying. Empirically, across CogniLoad with 25 models and 142,800 traces, the study shows that efficiency rankings can diverge from accuracy, most efficiency gaps arise from logic robustness, and verbalization overhead varies up to 9×, with task length (N) being the dominant driver of efficiency loss. The framework enables targeted interventions and benchmark design insights by distinguishing whether inefficiency stems from reasoning quality, verbosity, or budget-truncation behavior, thereby guiding model development and evaluation toward compute-efficient, reliable reasoning.

Abstract

Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.

Decomposing Reasoning Efficiency in Large Language Models

TL;DR

This paper introduces a trace-optional framework to dissect reasoning efficiency in large language models by decomposing token usage into interpretable components. It begins with an outcome-level analysis (E0) that factors efficiency into robustness against token-budget truncation, logic robustness, and verbosity, and then extends to workload-normalized verbosity using a deterministic per-instance workload proxy W_poi, introducing verbalization overhead VO and a workload-coupling coefficient κ. When reasoning traces are available, the framework adds a trace-quality decomposition that separates grounded, non-redundant signal from degenerate or prompt-copied content, via deterministic measures of grounding, repetition, and prompt copying. Empirically, across CogniLoad with 25 models and 142,800 traces, the study shows that efficiency rankings can diverge from accuracy, most efficiency gaps arise from logic robustness, and verbalization overhead varies up to 9×, with task length (N) being the dominant driver of efficiency loss. The framework enables targeted interventions and benchmark design insights by distinguishing whether inefficiency stems from reasoning quality, verbosity, or budget-truncation behavior, thereby guiding model development and evaluation toward compute-efficient, reliable reasoning.

Abstract

Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman ), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
Paper Structure (121 sections, 44 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 121 sections, 44 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Where do models waste tokens? Relative to o3, we decompose the efficiency gap $\Delta \log E_0$ into token budget truncation robustness ($\Delta \log r_{\text{ctx}}$), logic robustness ($\Delta \log r_{\text{logic}}$), and workload-normalized verbosity ($-\Delta \log \bar{VO} - \Delta \log \kappa$).
  • Figure 2: Trace-quality-normalized decomposition (12 trace-accessible models). We use DeepSeek-R1-Distill-Llama-70B as the reference for trace-quality analysis because it achieves the highest efficiency among trace-accessible models. The $q_{\text{trace}}$ term (signal density) captures efficiency lost to repetition, prompt-copying, and off-task text; $\bar{VO}_{\mathrm{sig}}$ captures overhead in signal tokens alone; $\kappa_{\mathrm{sig}}$ captures how signal tokens scale with task workload.
  • Figure 3: Verbalization overhead ($\bar{VO}$) vs. coupling ($\kappa$). Cross-model verbosity differences are primarily driven by $\bar{VO}$; $\kappa$ is consistently sublinear ($\kappa < 1$) and varies less across models.
  • Figure 4: Token efficiency $E_0$ across CogniLoad dimensions. Task length $N$ is the dominant bottleneck---efficiency drops 70--90% from $N=20$ to $N=250$. Difficulty shows diminishing effects after $d=3$; needle fraction has modest U-shaped effects.