Table of Contents
Fetching ...

The Serial Scaling Hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, Yutong Bai

TL;DR

The paper formalizes the Serial Scaling Hypothesis (SSH), arguing that many real-world tasks require long serial computation that cannot be efficiently parallelized by current architectures. By casting problems in a $ extsf{TC}$-theoretic framework, it delineates parallel ($ extsf{TC}$) vs inherently serial problems and demonstrates that diffusion models with $ extsf{TC}^0$ backbones have limited serial capacity, failing to solve general inherently serial tasks. It catalogs illustrative serial problems—cellular automata, many-body mechanics, sequential decision making, and math QA—showing they demand step-by-step computation that cannot be shortcut. The paper discusses implications for model design, hardware development, and benchmarks, arguing for architectures and training strategies that accommodate serial depth and for recognizing inherently serial tasks as a distinct benchmark category. Overall, it highlights a fundamental limit of parallel scaling and motivates a broader view of computation that includes significant serial computation in ML systems.

Abstract

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

The Serial Scaling Hypothesis

TL;DR

The paper formalizes the Serial Scaling Hypothesis (SSH), arguing that many real-world tasks require long serial computation that cannot be efficiently parallelized by current architectures. By casting problems in a -theoretic framework, it delineates parallel () vs inherently serial problems and demonstrates that diffusion models with backbones have limited serial capacity, failing to solve general inherently serial tasks. It catalogs illustrative serial problems—cellular automata, many-body mechanics, sequential decision making, and math QA—showing they demand step-by-step computation that cannot be shortcut. The paper discusses implications for model design, hardware development, and benchmarks, arguing for architectures and training strategies that accommodate serial depth and for recognizing inherently serial tasks as a distinct benchmark category. Overall, it highlights a fundamental limit of parallel scaling and motivates a broader view of computation that includes significant serial computation in ML systems.

Abstract

While machine learning has advanced through massive parallelization, we identify a critical blind spot: some problems are fundamentally sequential. These "inherently serial" problems-from mathematical reasoning to physical simulations to sequential decision-making-require sequentially dependent computational steps that cannot be efficiently parallelized. We formalize this distinction in complexity theory, and demonstrate that current parallel-centric architectures face fundamental limitations on such tasks. Then, we show for first time that diffusion models despite their sequential nature are incapable of solving inherently serial problems. We argue that recognizing the serial nature of computation holds profound implications on machine learning, model design, and hardware development.

Paper Structure

This paper contains 30 sections, 4 theorems, 11 equations, 8 figures, 2 tables.

Key Result

Theorem 4.1

If a problem can be solved by a diffusion model with a $\mathsf{TC}^0$ backbone with high probability with infinite diffusion steps, then the problem itself is in the parallelizable class $\mathsf{TC}^0$.

Figures (8)

  • Figure 1: (Left) Many easy Sudoku puzzles, where the circled blanks can be filled independently in parallel. (Right) A hard Sudoku with the same total compute, but the circled blanks are interdependent, requiring sequential reasoning.
  • Figure 2: . (A) A decision problem has a variable-size input and a fixed-size output (e.g., "yes"/"no"). (B) A serial problem requires deeper or more steps as the problem size grows. Examples of serial problems are: (C) Cellular automaton: takes the initial state as input and outputs a discrete value of the row $N$ at cell $i$ for $i \in \{1, \dots, 2N-1\}$. (D) Many-body mechanics: takes initial positions and momenta of each particle with time $T$ as inputs and outputs the particle locations at time $T$ in a limited-precision space. (E) Math QA: takes a question as input and outputs the answer autoregressively, with each output from a fixed set of possibilities.
  • Figure 3: The complexity classes are nested as $\mathsf{TC}^0 \subseteq \mathsf{TC}^1 \subseteq \dots \subseteq \mathsf{TC} \subseteq \mathsf{P}$. Each containment is widely believed to be strict. Problems in $\mathsf{TC}$ are parallel, while those outside are inherently serial.
  • Figure 4: A single run of Rule 110. Given the top row, the CA evolves row-by-row according to the 8 rules.
  • Figure 5: Predicting the frame at time $T$. The intermediate frames may not be observable by camera motion/occlusion.
  • ...and 3 more figures

Theorems & Definitions (9)

  • Definition 2.1: Informal, see \ref{['sec:tc']}
  • Definition 2.2
  • Theorem 4.1: Informal
  • Definition D.1
  • Theorem F.1
  • proof
  • Theorem G.1
  • Theorem G.2: optimal decision in the DO1 environment is inherently serial
  • proof