Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving
Marwan AbdElhameed, Pavly Halim
TL;DR
This work tackles the tension between deep reasoning and computational efficiency in large language models. It empirically analyzes two contrasting approaches—Quiet-STaR for enhanced reasoning and REBASE for compute-efficient inference—using the Mistral-7B model on GSM8K. Quiet-STaR delivers strong accuracy at high compute cost, while REBASE achieves notable efficiency with baseline-like accuracy; their integration unexpectedly degrades performance, revealing fundamental incompatibilities. The findings underscore the need for new architectures and unified optimization objectives that balance reasoning depth with practical compute constraints, guiding future research toward compute-efficient reasoning methods.
Abstract
Recent advances in large language models (LLMs) have predominantly focused on maximizing accuracy and reasoning capabilities, often overlooking crucial computational efficiency considerations. While this approach has yielded impressive accuracy improvements, it has led to methods that may be impractical for real-world deployment due to computational overhead and latency constraints. This paper investigates the potential synergy between reasoning enhancement and computational efficiency by analyzing the integration of two contrasting approaches: Quiet-STaR (Self-Taught Reasoner) and REBASE (REward BAlanced SEarch). Through comprehensive empirical analysis using the Mistral-7B model on the GSM8K dataset, we demonstrate that while each method excels in its primary objective-Quiet-STaR achieving superior accuracy (32.03%) despite high computational cost (554.66s runtime, 12.73T FLOPs), and REBASE providing exceptional efficiency (8.47s runtime, 2.35T FLOPs) while maintaining baseline-comparable accuracy (10.94%)-their integration reveals fundamental challenges in reconciling reasoning depth with computational efficiency. The combined approach unexpectedly results in degraded performance (9.38% accuracy, 143.66s runtime), highlighting critical insights about the complex interplay between reasoning enhancement and efficiency optimization in LLMs. Our findings illuminate the need for novel architectures and algorithms specifically designed to bridge the gap between these competing objectives, while providing concrete directions for future research in compute-efficient reasoning methods.
