Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?
Celine Lee, Alexander M. Rush, Keyon Vafa
TL;DR
This work introduces a DFA-based framework to study how task structure governs the optimal amount of reasoning tokens (the critical length $L^*$) used by LLMs at inference. It demonstrates a consistent existence of an $L^*$ that maximizes accuracy across tasks and models, and shows that $L^*$ correlates strongly with the DFA run length $N$ but weakly with the DFA state-space size $k$, implying that reasoning length primarily supports latent state tracking rather than representing larger state spaces. The authors show that a simple predictor using $(k,N)$ can forecast $L^*$ with $R^2 \,\approx\,0.65$ and that filtering generations to near the predicted $L^*$ yields measurable accuracy gains, especially for larger or COT-RL models. These findings offer actionable guidance for inference-time compute: tailoring reasoning length to task structure can improve accuracy and efficiency, motivating future work on more refined predictors and extensions to complex tasks.
Abstract
Large language models (LLMs) often benefit from verbalized reasoning at inference time, but it remains unclear which aspects of task difficulty these extra reasoning tokens address. To investigate this question, we formalize a framework using deterministic finite automata (DFAs). DFAs offer a formalism through which we can characterize task complexity through measurable properties such as run length (number of reasoning steps required) and state-space size (decision complexity). We first show that across different tasks and models of different sizes and training paradigms, there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then investigate which properties of complexity govern this critical length: we find that task instances with longer corresponding underlying DFA runs (i.e. demand greater latent state-tracking requirements) correlate with longer reasoning lengths, but, surprisingly, that DFA size (i.e. state-space complexity) does not. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
