Table of Contents
Fetching ...

On the Fundamental Limits of LLMs at Scale

Muhammad Ahmed Mohsin, Muhammad Umer, Ahsan Bilal, Zeeshan Memon, Muhammad Ibtsaam Qadir, Sagnik Bhattacharya, Hassan Rizwan, Abhiram R. Gorle, Maahe Zehra Kazmi, Ayesha Mohsin, Muhammad Usman Rafique, Zihao He, Pulkit Mehta, Muhammad Ali Jamshed, John M. Cioffi

TL;DR

The paper formalizes five fundamental limits of LLMs at scale—hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment—and links them to core principles in computability, information theory, and learnability. It presents a unified, proof-informed framework showing why scaling cannot universally erase these pathologies and outlines hierarchical bounds (diagonalization, undecidability, compression limits, and sample complexity) governing open-ended queries. The work also surveys mechanisms driving these failures in data, evaluation, and optimization, and proposes mitigation patterns such as bounded-oracle retrieval, positional curricula, and sparse/hierarchical attention, together with program-aided, neuro-symbolic, and multi-modal strategies. The practical impact is a shift from chasing unbounded scaling to designing architecture-aware, verifiable, and uncertainty-aware systems with calibrated evaluation and robust deployment considerations. Overall, the paper provides a principled roadmap for understanding, bounding, and navigating the intrinsic fallibility of LLMs in real-world settings.

Abstract

Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.

On the Fundamental Limits of LLMs at Scale

TL;DR

The paper formalizes five fundamental limits of LLMs at scale—hallucination, context compression, reasoning degradation, retrieval fragility, and multimodal misalignment—and links them to core principles in computability, information theory, and learnability. It presents a unified, proof-informed framework showing why scaling cannot universally erase these pathologies and outlines hierarchical bounds (diagonalization, undecidability, compression limits, and sample complexity) governing open-ended queries. The work also surveys mechanisms driving these failures in data, evaluation, and optimization, and proposes mitigation patterns such as bounded-oracle retrieval, positional curricula, and sparse/hierarchical attention, together with program-aided, neuro-symbolic, and multi-modal strategies. The practical impact is a shift from chasing unbounded scaling to designing architecture-aware, verifiable, and uncertainty-aware systems with calibrated evaluation and robust deployment considerations. Overall, the paper provides a principled roadmap for understanding, bounding, and navigating the intrinsic fallibility of LLMs in real-world settings.

Abstract

Large Language Models (LLMs) have benefited enormously from scaling, yet these gains are bounded by five fundamental limitations: (1) hallucination, (2) context compression, (3) reasoning degradation, (4) retrieval fragility, and (5) multimodal misalignment. While existing surveys describe these phenomena empirically, they lack a rigorous theoretical synthesis connecting them to the foundational limits of computation, information, and learning. This work closes that gap by presenting a unified, proof-informed framework that formalizes the innate theoretical ceilings of LLM scaling. First, computability and uncomputability imply an irreducible residue of error: for any computably enumerable model family, diagonalization guarantees inputs on which some model must fail, and undecidable queries (e.g., halting-style tasks) induce infinite failure sets for all computable predictors. Second, information-theoretic and statistical constraints bound attainable accuracy even on decidable tasks, finite description length enforces compression error, and long-tail factual knowledge requires prohibitive sample complexity. Third, geometric and computational effects compress long contexts far below their nominal size due to positional under-training, encoding attenuation, and softmax crowding. We further show how likelihood-based training favors pattern completion over inference, how retrieval under token limits suffers from semantic drift and coupling noise, and how multimodal scaling inherits shallow cross-modal alignment. Across sections, we pair theorems and empirical evidence to outline where scaling helps, where it saturates, and where it cannot progress, providing both theoretical foundations and practical mitigation paths like bounded-oracle retrieval, positional curricula, and sparse or hierarchical attention.

Paper Structure

This paper contains 90 sections, 16 theorems, 108 equations, 8 figures, 1 table.

Key Result

Theorem 1

For any computably enumerable set of LLMs $\{h_0, h_1, h_2, \ldots\}$, where each $h_i: \Sigma^* \to \mathcal{Y}$ maps input strings to outputs, there exists a computable ground-truth function $f: \Sigma^* \to \mathcal{Y}$ such that every model state $h_i^{[j]}$ (at training step $j$) hallucinates o

Figures (8)

  • Figure 1: Five interacting fronts that bound LLM reliability. Long context window: practical use is curtailed by training on finite windows, inputs that exceed the window, positional-encoding overlap, and computational constraints. Reasoning: adherence to rules/logic, exploitation of reasoning patterns, and cross-step consistency remain brittle. Hallucination: prompt and sentence-level contradictions—amplified by language complexity, induce factual errors. Retrieval quality: database and evidence selection are filtered by retrieval metrics, yet degrade under query ambiguity and attention distraction during integration. Multimodality: cross-modal inputs introduce architectural colonization effects, epistemic pitfalls, and scaling/deployment challenges. Arrows indicate information flow and couplings among factors analyzed in subsequent sections.
  • Figure 2: Taxonomy of hallucination sources in LLMs.(Fundamental limits.) Diagonalization (no enumerable model set answers all queries), uncomputability (undecidable problems force infinite failures), and statistical constraints (finite models cannot compress infinite information). (Data failures.) Incomplete coverage, noise (2--3% error rates), long-tail distributions, temporal decay (>50% staleness after 6 months), conflicts, and exposure bias. (Evaluation misalignment.) Binary grading equates uncertainty with wrong answers, incentivizing fabrication across benchmarks {MMLU-Pro, Graduate-Level Google-Proof Q&A (GPQA), MATH}, causing reinforcement learning from human feedback (RLHF) reward hacking and overconfidence. (Creativity-factuality trade-off.) Low temperature yields accurate but repetitive outputs; high temperature enables diversity but increases errors.
  • Figure 3: Empirical evidence of data-induced hallucinations.(a) Model accuracy exhibits a steep degradation for rare entities, dropping from >95% for highly popular entities (100k+ Wikipedia views/day) to <40% for tail entities (<100 views/day). (b) Information validity decays over time since the training cutoff. While static facts remain valid indefinitely and demographics change slowly, rapidly evolving domains cross the 50% validity threshold within 6 months, causing temporally induced hallucinations as models lack explicit temporal reasoning and treat all training data as contemporaneous.
  • Figure 4: Position-frequency distribution for models trained with 2K vs. 4K sequence lengths after 1T tokens.
  • Figure 5: Overview of the three main factors limiting effective long-context reasoning in transformers. (1) Training distribution skew: long positions are underrepresented, leaving distant tokens undertrained. (2) Positional encoding attenuation: sinusoidal cancellation or RoPE phase misalignment shrinks positional overlap $S_{\text{pos}}(\Delta)$, weakening long-range alignment. (3) Attention computation limits: softmax crowding requires $\sim \ln N$ score margins to overcome distractors, while quadratic memory/computation further restricts practical sequence length.
  • ...and 3 more figures

Theorems & Definitions (24)

  • Theorem 1: Inevitability for enumerable LLMs
  • proof
  • Theorem 2: Infinite hallucinations
  • proof
  • Theorem 3: Undecidable problems force hallucination
  • proof
  • Lemma 1: Kolmogorov complexity bottleneck
  • proof
  • Theorem 4: Sample complexity for arbitrary facts
  • proof
  • ...and 14 more