Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs
Benjamin Estermann, Roger Wattenhofer
TL;DR
This work investigates how the reasoning effort of large language models scales with problem complexity using the infinitely scalable Tents puzzle as a controlled testbed. By measuring the total reasoning tokens and puzzle success across grid sizes, and by fitting linear models to token counts as a function of problem size, the study analyzes extrapolative reasoning behavior and compares several state-of-the-art models. Key findings show that reasoning effort generally increases with problem size for solvable instances, but may peak or decline at higher complexities, indicating limits in logical coherence and scalability. The results motivate broader benchmarking with PUZZLES and targeted strategies to improve reasoning efficiency and extrapolation in complex problem solving.
Abstract
Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.
