Decomposing Reasoning Efficiency in Large Language Models
Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud
TL;DR
This paper introduces a trace-optional framework to dissect reasoning efficiency in large language models by decomposing token usage into interpretable components. It begins with an outcome-level analysis (E0) that factors efficiency into robustness against token-budget truncation, logic robustness, and verbosity, and then extends to workload-normalized verbosity using a deterministic per-instance workload proxy W_poi, introducing verbalization overhead VO and a workload-coupling coefficient κ. When reasoning traces are available, the framework adds a trace-quality decomposition that separates grounded, non-redundant signal from degenerate or prompt-copied content, via deterministic measures of grounding, repetition, and prompt copying. Empirically, across CogniLoad with 25 models and 142,800 traces, the study shows that efficiency rankings can diverge from accuracy, most efficiency gaps arise from logic robustness, and verbalization overhead varies up to 9×, with task length (N) being the dominant driver of efficiency loss. The framework enables targeted interventions and benchmark design insights by distinguishing whether inefficiency stems from reasoning quality, verbosity, or budget-truncation behavior, thereby guiding model development and evaluation toward compute-efficient, reliable reasoning.
Abstract
Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
