Table of Contents
Fetching ...

Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Mohammad Mahdi Samiei Paqaleh, Arash Marioriyad, Arman Tahmasebi-Zadeh, Mohamadreza Fereydooni, Mahdi Ghaznavai, Mahdieh Soleymani Baghshah

TL;DR

This work defines Complexity OoD as the capacity to generalize to problem instances whose minimal solution complexity is higher than seen in training, formalizing representational complexity with $K(x)$ and computational complexity with $K(y\mid x)$. It argues that genuine System-2 reasoning emerges under complexity pressure and that even System-2 can be viewed as a form of learned generalization over solution structures, thereby unifying learning and reasoning. The paper surveys existing work on variable-length representations and computations, LLM-based reasoning approaches, and complexity-aware evaluation, and it outlines a roadmap for evaluating and building agents with unbounded representational capacity, adaptive depth, and external memory. It emphasizes the need for new benchmarks, supervision paradigms, and inductive biases to enable robust, scalable reasoning beyond data scaling, with practical implications for AI that truly

Abstract

Recent progress has pushed AI frontiers from pattern recognition tasks toward problems that require step by step, System2 style reasoning, especially with large language models. Yet, unlike learning, where generalization and out of distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System1 like processing at low complexity become System2 like under complexity pressure, while System2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design, rethinking supervision to target solution traces, seeking and designing inductive biases for Complexity OoD generalization, addressing learning to reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step wise calibration. Because Complexity OoD cannot be solved by scaling data alone, progress toward robust reasoning will require architectures and training regimes that explicitly model and allocate computation with respect to complexity.

Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

TL;DR

This work defines Complexity OoD as the capacity to generalize to problem instances whose minimal solution complexity is higher than seen in training, formalizing representational complexity with and computational complexity with . It argues that genuine System-2 reasoning emerges under complexity pressure and that even System-2 can be viewed as a form of learned generalization over solution structures, thereby unifying learning and reasoning. The paper surveys existing work on variable-length representations and computations, LLM-based reasoning approaches, and complexity-aware evaluation, and it outlines a roadmap for evaluating and building agents with unbounded representational capacity, adaptive depth, and external memory. It emphasizes the need for new benchmarks, supervision paradigms, and inductive biases to enable robust, scalable reasoning beyond data scaling, with practical implications for AI that truly

Abstract

Recent progress has pushed AI frontiers from pattern recognition tasks toward problems that require step by step, System2 style reasoning, especially with large language models. Yet, unlike learning, where generalization and out of distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System1 like processing at low complexity become System2 like under complexity pressure, while System2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design, rethinking supervision to target solution traces, seeking and designing inductive biases for Complexity OoD generalization, addressing learning to reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step wise calibration. Because Complexity OoD cannot be solved by scaling data alone, progress toward robust reasoning will require architectures and training regimes that explicitly model and allocate computation with respect to complexity.

Paper Structure

This paper contains 47 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: Complexity out-of-distribution generalization evaluates whether models trained on problems whose solutions require few, shallow steps generalize to problems whose solutions demand substantially more steps and deeper composition. Two instantiations are shown. Top-Roman numerals: training solutions involve short additive/subtractive decompositions; test solutions require many more operations to expand more complex numerals. Bottom-Visual Question Answering: training features few-hop questions in a simple scene; testing uses relational, multi-hop questions in a busier scene with more entities. The shift is in solution complexity (and, for VQA, scene clutter), not in domain or content.
  • Figure 2: Two examples from the GSM8K dataset in which the number of mathematical operations required to solve the problem can be considered as a proxy for the complexity of the sample problem.
  • Figure 3: (a) The frequency distribution of problem complexity in the GSM8K dataset, measured by the number of arithmetic steps in reference solutions. Most problems are simple, leading to a strong imbalance in the dataset. (b) Model accuracy on GSM8K across different complexity levels, illustrating that as problem complexity increases, model performance drops (especially for non-reasoning models) highlighting the key challenge of complexity out-of-distribution (OoD) generalization. This analysis reveals limitations obscured by average-case metrics and motivates the need for complexity-aware evaluation in benchmarking reasoning ability.
  • Figure 4: (a) Accuracy of Language Models on AIME by Human Solution Token Length. Complexity is estimated by the number of tokens in provided human-written solutions for each problem, used here as a proxy for the length and intricacy of multi-step reasoning required. Accuracy declines notably with increasing solution length for all models; however, models designed for advanced reasoning (DeepSeek-R1, GPT-o3-mini) maintain higher accuracy and exhibit a gentler degradation compared to more general-purpose models. (b) Accuracy of Language Models on Omni-MATH by Human Solution Token Length. .Here too, as the solution complexity (token count) rises, model accuracy drops, especially for models not explicitly optimized for complex reasoning. The pattern reinforces the importance of evaluating models on complexity out-of-distribution (OoD) instances.