Table of Contents
Fetching ...

Limitations of Autoregressive Models and Their Alternatives

Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Gormley, Jason Eisner

TL;DR

Autoregressive models efficiently assign $p(\mathbf{x})=\prod_t p(x_t|\mathbf{x}_{<t})$ but cannot represent distributions where next-symbol probabilities are hard to compute. The paper introduces a formal framework of efficiently computable and compact-parameter weighted languages (EC/ECCP) and their autoregressive counterparts (ELN/ELNCP), linking them to complexity classes such as $\mathrm{P}$, $\mathrm{P/poly}$, and $\mathrm{NP/poly}$ via reductions from SAT. It shows autoregressive ECCP models have strictly reduced capacity relative to general ECCP/EC models, proving the existence of EC distributions whose local conditionals cannot be efficiently learned by ELNCP, and that ELNCP cannot capture all supports or orderings of EC/ECCP distributions. The authors propose alternative families—energy-based models (EBMs), latent-variable autoregressive models, and lookup/semiparametric approaches—that can escape these limitations, with RESIDUAL EBMs demonstrating practical perplexity gains. Overall, the work clarifies fundamental tradeoffs between efficient scoring, parameter efficiency, and expressive power, guiding the design of future NLP models that balance computation with representational reach, and suggests several directions for experimental validation and theoretical refinement with $p$- and $\NP$-theoretic considerations.

Abstract

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.

Limitations of Autoregressive Models and Their Alternatives

TL;DR

Autoregressive models efficiently assign but cannot represent distributions where next-symbol probabilities are hard to compute. The paper introduces a formal framework of efficiently computable and compact-parameter weighted languages (EC/ECCP) and their autoregressive counterparts (ELN/ELNCP), linking them to complexity classes such as , , and via reductions from SAT. It shows autoregressive ECCP models have strictly reduced capacity relative to general ECCP/EC models, proving the existence of EC distributions whose local conditionals cannot be efficiently learned by ELNCP, and that ELNCP cannot capture all supports or orderings of EC/ECCP distributions. The authors propose alternative families—energy-based models (EBMs), latent-variable autoregressive models, and lookup/semiparametric approaches—that can escape these limitations, with RESIDUAL EBMs demonstrating practical perplexity gains. Overall, the work clarifies fundamental tradeoffs between efficient scoring, parameter efficiency, and expressive power, guiding the design of future NLP models that balance computation with representational reach, and suggests several directions for experimental validation and theoretical refinement with - and -theoretic considerations.

Abstract

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.

Paper Structure

This paper contains 41 sections, 23 theorems, 11 equations, 2 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

For any $L \in \mathrm{P}$, there exists an EC weighted language with support $L$. For any $L \in \mathrm{P/poly}$, there exists an ECCP language with support $L$. But for any $L \in \mathrm{NP\text{-}complete}$, there exists no ECCP language with support $L$ (assuming $\mathrm{NP}\nsubseteq\mathrm{

Figures (2)

  • Figure 1: Valid answers to hard natural language inference problems can be hard to find xkcd, but in many cases can be checked efficiently (e.g. the Knapsack problem in the comic). Given a large enough parametric autoregressive model with correct parameters, we can efficiently solve all problem instances with input length $n$, and efficiently verify the solutions --- but the required model size can grow superpolynomially in $n$. (This allows the model to store precomputed results that we can look up in $O(n)$ at test time.) A main observation of this paper is that assuming ${\mathrm{NP} \nsubseteq \mathrm{P/poly}}$, then without such a superpolynomial growth in model size, autoregressive models cannot even be used to verify answers to some problems where polynomial-time verification algorithms do exist.
  • Figure 2: The space of unweighted languages. We assume in this diagram that $\mathrm{NP}\nsubseteq\mathrm{P/poly}$. Each rectangular outline corresponds to a complexity class (named in its lower right corner) and encloses the languages whose decision problems fall into that class. Each bold-italic label (colored to match its shape outline) names a model family and encloses the languages that can be expressed as the support of some weighted language in that family. All induced partitions in the figure are non-empty sets: shape A properly encloses shape B if and only if language class A is a strict superset of language class B. As mentioned in \ref{['fig:modelfamilycomparison']}, standard autoregressive models (ELN models) have support languages that form a strict subset of $\mathrm{P}$ (\ref{['thm:eccplimit', 'thm:closure', 'thm:in-ec-not-in-eln', 'sec:p-poly']}). ELNCP models (\ref{['sec:local-normalization']}) extend ELN models by allowing the parameter size to grow polynomially in string length, allowing them to capture both more languages inside $\mathrm{P}$ (\ref{['thm:in-ec-in-elncp-not-in-eln']}) and languages outside $\mathrm{P}$ (including undecidable but sparse languages) that can be characterized autoregressively with the help of these compact parameters. All of those languages belong in the class $\mathrm{P/poly}$. \ref{['thm:nopforu']} establishes that energy-based (EC) and ECCP models go strictly further than ELN and ELNCP models, respectively (\ref{['thm:nopforu']}): they correspond to the entire classes $\mathrm{P}$ and $\mathrm{P/poly}$ (\ref{['thm:eccplimit']}). However, even ECCP does not capture any $\mathrm{NP}$-complete languages under our assumption $\mathrm{NP}\nsubseteq\mathrm{P/poly}$. Allowing a polynomial number of latent symbols extends the power further still: lightly marginalized ELNCP or ECCP distributions cover exactly the languages $\in \mathrm{NP/poly}$ (\ref{['thm:nppolyifflmeccp']}). Finally, if we were to drop the requirement that the parameters ${\boldsymbol{\mathbf{\Theta}}}$ must be compact, we could store lookup tries to model any weighted language (\ref{['sec:semiparametric-models']}).

Theorems & Definitions (39)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • proof : Proof sketch
  • Theorem 2
  • proof
  • Lemma 3
  • Theorem 3
  • Theorem 4
  • Theorem 7
  • ...and 29 more