EffiReason-Bench: A Unified Benchmark for Evaluating and Advancing Efficient Reasoning in Large Language Models
Junquan Huang, Haotian Wu, Yubo Gao, Yibo Yan, Junyan Zhang, Yonghua Hei, Song Dai, Jie Zhang, Puay Siew Tan, Xuming Hu
TL;DR
EffiReason-Bench introduces a unified benchmark and a principled trade-off metric, $E^3$-Score, to fairly evaluate efficient reasoning across three paradigms, backbones, and reasoning domains in large language models. By constructing high-quality CoT annotations for CommonsenseQA and LogiQA and deploying seven methods across six backbones on four datasets, the study reveals that no single approach universally dominates; performance depends on backbone scale, task complexity, and architecture. Key findings show that train-free blueprint methods can yield extreme token compression but often at the cost of accuracy, while train-based pruning maintains or improves accuracy with modest gains; dynamic latent-space methods like Soft Thinking demonstrate robustness, and post-hoc TokenSkip offers favorable trade-offs in several domains but is architecture-sensitive. The work provides a reproducible foundation for advancing efficient reasoning and highlights the importance of cross-domain, cross-backbone evaluation to inform practical deployment of reasoning strategies.
Abstract
Large language models (LLMs) with Chain-of-Thought (CoT) prompting achieve strong reasoning but often produce unnecessarily long explanations, increasing cost and sometimes reducing accuracy. Fair comparison of efficiency-oriented approaches is hindered by fragmented evaluation practices. We introduce EffiReason-Bench, a unified benchmark for rigorous cross-paradigm evaluation of efficient reasoning methods across three categories: Reasoning Blueprints, Dynamic Execution, and Post-hoc Refinement. To enable step-by-step evaluation, we construct verified CoT annotations for CommonsenseQA and LogiQA via a pipeline that enforces standardized reasoning structures, comprehensive option-wise analysis, and human verification. We evaluate 7 methods across 6 open-source LLMs (1B-70B) on 4 datasets spanning mathematics, commonsense, and logic, and propose the E3-Score, a principled metric inspired by economic trade-off modeling that provides smooth, stable evaluation without discontinuities or heavy reliance on heuristics. Experiments show that no single method universally dominates; optimal strategies depend on backbone scale, task complexity, and architecture.
