Limitations of Autoregressive Models and Their Alternatives
Chu-Cheng Lin, Aaron Jaech, Xin Li, Matthew R. Gormley, Jason Eisner
TL;DR
Autoregressive models efficiently assign $p(\mathbf{x})=\prod_t p(x_t|\mathbf{x}_{<t})$ but cannot represent distributions where next-symbol probabilities are hard to compute. The paper introduces a formal framework of efficiently computable and compact-parameter weighted languages (EC/ECCP) and their autoregressive counterparts (ELN/ELNCP), linking them to complexity classes such as $\mathrm{P}$, $\mathrm{P/poly}$, and $\mathrm{NP/poly}$ via reductions from SAT. It shows autoregressive ECCP models have strictly reduced capacity relative to general ECCP/EC models, proving the existence of EC distributions whose local conditionals cannot be efficiently learned by ELNCP, and that ELNCP cannot capture all supports or orderings of EC/ECCP distributions. The authors propose alternative families—energy-based models (EBMs), latent-variable autoregressive models, and lookup/semiparametric approaches—that can escape these limitations, with RESIDUAL EBMs demonstrating practical perplexity gains. Overall, the work clarifies fundamental tradeoffs between efficient scoring, parameter efficiency, and expressive power, guiding the design of future NLP models that balance computation with representational reach, and suggests several directions for experimental validation and theoretical refinement with $p$- and $\NP$-theoretic considerations.
Abstract
Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they cannot model distributions whose next-symbol probability is hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.
