Table of Contents
Fetching ...

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, Dylan J. Foster

TL;DR

It is confirmed that under misspecification, the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor up to a multiplicative approximation factor, that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis.

Abstract

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length $H$ increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor $C\geq 1$ -- we confirm that $C$ indeed grows with $H$ for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve $C=O(1)$. (2) Next-token prediction can be made robust so as to achieve $C=\tilde O(H)$, representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer $C=Ω(H)$. (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor $C=e^{(\log H)^{1-Ω(1)}}$; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on $C=Ω(H)$ in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction.

Computational-Statistical Tradeoffs at the Next-Token Prediction Barrier: Autoregressive and Imitation Learning under Misspecification

TL;DR

It is confirmed that under misspecification, the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor up to a multiplicative approximation factor, that indeed grows with for next-token prediction, lending theoretical support to this empirical hypothesis.

Abstract

Next-token prediction with the logarithmic loss is a cornerstone of autoregressive sequence modeling, but, in practice, suffers from error amplification, where errors in the model compound and generation quality degrades as sequence length increases. From a theoretical perspective, this phenomenon should not appear in well-specified settings, and, indeed, a growing body of empirical work hypothesizes that misspecification, where the learner is not sufficiently expressive to represent the target distribution, may be the root cause. Under misspecification -- where the goal is to learn as well as the best-in-class model up to a multiplicative approximation factor -- we confirm that indeed grows with for next-token prediction, lending theoretical support to this empirical hypothesis. We then ask whether this mode of error amplification is avoidable algorithmically, computationally, or information-theoretically, and uncover inherent computational-statistical tradeoffs. We show: (1) Information-theoretically, one can avoid error amplification and achieve . (2) Next-token prediction can be made robust so as to achieve , representing moderate error amplification, but this is an inherent barrier: any next-token prediction-style objective must suffer . (3) For the natural testbed of autoregressive linear models, no computationally efficient algorithm can achieve sub-polynomial approximation factor ; however, at least for binary token spaces, one can smoothly trade compute for statistical power and improve on in sub-exponential time. Our results have consequences in the more general setting of imitation learning, where the widely-used behavior cloning algorithm generalizes next-token prediction.

Paper Structure

This paper contains 82 sections, 67 theorems, 299 equations, 5 algorithms.

Key Result

proposition 1

Whenever $\pi^{\star}\in\Pi$, the estimator $\widehat{\pi}$ in eq:bc satisfies that $D^{2}_{\mathsf{H}}([)]{\mathbb{P}^{\widehat{\pi}},\mathbb{P}^{\pi^{\star}}} \leq 2\log(\left\lvert\Pi\right\rvert\delta^{-1})/n$ with probability at least $1-\delta$.For simplicity, we work with finite classes $\Pi$

Theorems & Definitions (137)

  • proposition 1: foster2024behavior
  • remark 1: Connection to imitation learning
  • remark 2: Terminology for next-token prediction
  • proposition 2: informal; see \ref{['cor:linear-wellspec-logloss']}
  • theorem 1
  • definition 1: Density ratio bound
  • theorem 2
  • proposition 3: Tightness of thm:log-loss-bounded
  • corollary 1
  • corollary 2
  • ...and 127 more