Table of Contents
Fetching ...

Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

Benjamin Estermann, Roger Wattenhofer

TL;DR

This work investigates how the reasoning effort of large language models scales with problem complexity using the infinitely scalable Tents puzzle as a controlled testbed. By measuring the total reasoning tokens and puzzle success across grid sizes, and by fitting linear models to token counts as a function of problem size, the study analyzes extrapolative reasoning behavior and compares several state-of-the-art models. Key findings show that reasoning effort generally increases with problem size for solvable instances, but may peak or decline at higher complexities, indicating limits in logical coherence and scalability. The results motivate broader benchmarking with PUZZLES and targeted strategies to improve reasoning efficiency and extrapolation in complex problem solving.

Abstract

Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.

Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

TL;DR

This work investigates how the reasoning effort of large language models scales with problem complexity using the infinitely scalable Tents puzzle as a controlled testbed. By measuring the total reasoning tokens and puzzle success across grid sizes, and by fitting linear models to token counts as a function of problem size, the study analyzes extrapolative reasoning behavior and compares several state-of-the-art models. Key findings show that reasoning effort generally increases with problem size for solvable instances, but may peak or decline at higher complexities, indicating limits in logical coherence and scalability. The results motivate broader benchmarking with PUZZLES and targeted strategies to improve reasoning efficiency and extrapolation in complex problem solving.

Abstract

Large Language Models (LLMs) have demonstrated remarkable text generation capabilities, and recent advances in training paradigms have led to breakthroughs in their reasoning performance. In this work, we investigate how the reasoning effort of such models scales with problem complexity. We use the infinitely scalable Tents puzzle, which has a known linear-time solution, to analyze this scaling behavior. Our results show that reasoning effort scales with problem size, but only up to a critical problem complexity. Beyond this threshold, the reasoning effort does not continue to increase, and may even decrease. This observation highlights a critical limitation in the logical coherence of current LLMs as problem complexity increases, and underscores the need for strategies to improve reasoning scalability. Furthermore, our results reveal significant performance differences between current state-of-the-art reasoning models when faced with increasingly complex logical puzzles.

Paper Structure

This paper contains 14 sections, 9 figures.

Figures (9)

  • Figure 1: An example instance of a partially solved 6 by 6 tents puzzle. Tents need to be placed next to trees, away from other tents and fulfilling the row and column constraints.
  • Figure 2: (a) Reasoning effort in number of reasoning tokens versus problem size for DeepSeek R1, o3-mini, and Qwen/QwQ-32B-Preview. Successful attempts only. Linear fits are added for each model. Gemini 2.0 Flash Thinking is excluded due to unknown number of thinking tokens. (b) Solved percentage versus problem size for all models. No model solved problems larger than size 100. o3-mini achieves the highest success rate, followed by DeepSeek R1 and Gemini 2.0 Flash Thinking. Qwen/QwQ-32B-Preview struggles with problem instances larger than size 20.
  • Figure 3: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini. A peak in reasoning effort is observed around problem size 100, followed by a decline for larger problem sizes. (b) Reasoning effort in number of reasoning tokens versus problem size for o3-mini, categorized by low, medium, and high reasoning effort strategies. Steeper slopes are observed for higher reasoning effort strategies. High reasoning effort enables solving larger instances but also increases token usage for smaller, already solvable problems.
  • Figure 4: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini with reasoning effort low. Successful tries only. Linear fits are added for each model. (b) Solved percentage versus problem size for o3-mini with reasoning effort low.
  • Figure 5: (a) Reasoning effort in number of reasoning tokens versus problem size for o3-mini with reasoning effort medium. Successful tries only. Linear fits are added for each model. (b) Solved percentage versus problem size for o3-mini with reasoning effort medium.
  • ...and 4 more figures