Table of Contents
Fetching ...

When Reasoning Meets Its Laws

Junyu Zhang, Yifan Sun, Tianang Leng, Jingyan Shen, Liu Ziyin, Paul Pu Liang, Huan Zhang

TL;DR

The paper introduces the Laws of Reasoning (LoRe) framework to formalize how Large Reasoning Models allocate compute and degrade accuracy as task complexity grows. It defines two tractable proxies, monotonicity and compositionality, and builds LoRe-Bench to evaluate them; it formalizes a compute law and an accuracy law linking compute and accuracy to a complexity measure. Empirically, current LRMs show strong monotonicity but poor compositionality, motivating a compositionality-focused finetuning method (SFT-Compo) that yields improvements across multiple benchmarks and model sizes, with notable synergistic gains across properties. The work provides both a theoretical lens and a practical toolkit for guiding LRMs toward more human-like trade-offs in reasoning, with open-source resources for reproduction and further research.

Abstract

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/

When Reasoning Meets Its Laws

TL;DR

The paper introduces the Laws of Reasoning (LoRe) framework to formalize how Large Reasoning Models allocate compute and degrade accuracy as task complexity grows. It defines two tractable proxies, monotonicity and compositionality, and builds LoRe-Bench to evaluate them; it formalizes a compute law and an accuracy law linking compute and accuracy to a complexity measure. Empirically, current LRMs show strong monotonicity but poor compositionality, motivating a compositionality-focused finetuning method (SFT-Compo) that yields improvements across multiple benchmarks and model sizes, with notable synergistic gains across properties. The work provides both a theoretical lens and a practical toolkit for guiding LRMs toward more human-like trade-offs in reasoning, with open-source resources for reproduction and further research.

Abstract

Despite the superior performance of Large Reasoning Models (LRMs), their reasoning behaviors are often counterintuitive, leading to suboptimal reasoning capabilities. To theoretically formalize the desired reasoning behaviors, this paper presents the Laws of Reasoning (LoRe), a unified framework that characterizes intrinsic reasoning patterns in LRMs. We first propose compute law with the hypothesis that the reasoning compute should scale linearly with question complexity. Beyond compute, we extend LoRe with a supplementary accuracy law. Since the question complexity is difficult to quantify in practice, we examine these hypotheses by two properties of the laws, monotonicity and compositionality. We therefore introduce LoRe-Bench, a benchmark that systematically measures these two tractable properties for large reasoning models. Evaluation shows that most reasoning models exhibit reasonable monotonicity but lack compositionality. In response, we develop an effective finetuning approach that enforces compute-law compositionality. Extensive empirical studies demonstrate that better compliance with compute laws yields consistently improved reasoning performance on multiple benchmarks, and uncovers synergistic effects across properties and laws. Project page: https://lore-project.github.io/

Paper Structure

This paper contains 40 sections, 6 theorems, 47 equations, 11 figures, 3 tables.

Key Result

Proposition 1

Under certain conditions, if a reasoning model $M_\theta$ satisfies compute-complexity monotonicity and compositionality, then its reasoning compute $C_\theta(x) \propto \kappa(x)$ for $x\in\mathcal{X}$.

Figures (11)

  • Figure 1: Illustrative example with DeepSeek-R1 on (a) a summation question, (b) a squaring question, and (c) their composition ("sum, then square"). The model allocates 300 more reasoning tokens to solve the squaring question than to the composite question, with a 12.5% accuracy drop. The mismatch with human reasoning reveals an abnormal reasoning pattern present in current LRMs.
  • Figure 2: Overview of the LoRe Framework. We present the compute law with the complementary accuracy law. These laws posit that compute scales linearly and accuracy decays exponentially with question complexity. Our framework approximates these laws using two properties: monotonicity and compositionality. Specifically, for the compute law, monotonicity captures that more complex questions require more compute, while compositionality indicates that for two independent questions, the compute for their composition is the sum of solving each individually.
  • Figure 2: Compositionality Results on LoRe-Compo. We calculate $\mathrm{nMAD}$ for reasoning compute ($C_\theta$) and log accuracy ($\log A_\theta$).
  • Figure 3: Question Generation of LoRe-Mono. For each seed question, we generate 30 variants with increasing complexity. Specifically, variant $N$ applies the update rules $N$ times to compute the answer, so the question complexity increases monotonically with $N$.
  • Figure 4: Visualizations of Monotonicity Results on DeepSeek-R1-1.5B. For each domain, we plot reasoning compute and log accuracy as a function of variant index. The curves report the mean accuracy across 10 questions series, and the shaded regions denote the standard deviation.
  • ...and 6 more figures

Theorems & Definitions (11)

  • Definition 1: Complexity
  • Definition 3: Reasoning Accuracy
  • Definition 4: Independence
  • Proposition 1
  • Proposition 2
  • Proposition 1: Formal Version
  • proof
  • Corollary D.1: Asymptotic version with sublinear overhead
  • Proposition 2: Formal Version
  • proof
  • ...and 1 more