Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-dimensional Tokens
Vittorio Erba, Emanuele Troiani, Luca Biggio, Antoine Maillard, Lenka Zdeborová
TL;DR
This work introduces Bilinear Sequence Regression (BSR) as a tractable, analytically solvable model for learning from long sequences of high‑dimensional tokens, and frames a Bayes‑optimal learning problem in the high‑dimensional limit. It derives exact Bayes‑optimal generalization errors in the extensive‑width regime via a fixed‑point system for order parameters (e.g., $q$ and $\hat q$) and provides a Generalized AMP algorithm (GAMP‑RIE) that achieves these limits, highlighting the advantage of exploiting sequence/embedding structure over naive vectorization. The study further compares Bayes‑optimal performance to linear regression and minimal nuclear norm estimators, showing substantial gains when the width is finite and the prior is structured; it also investigates gradient descent dynamics, revealing that averaged GD can reach Bayes‑optimal performance in the noiseless setting, while noise introduces richer and more delicate behavior. By connecting BSR to transformer/Mixer backbones and outlining clear phase transitions (strong/weak recovery) controlled by width, aspect ratio, and sample complexity, the paper provides a principled theoretical foundation for learning from long sequences and informs the design of sequence architectures and training algorithms with attention to structural priors.
Abstract
Current progress in artificial intelligence is centered around so-called large language models that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (aka generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens, and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.
