Table of Contents
Fetching ...

Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-dimensional Tokens

Vittorio Erba, Emanuele Troiani, Luca Biggio, Antoine Maillard, Lenka Zdeborová

TL;DR

This work introduces Bilinear Sequence Regression (BSR) as a tractable, analytically solvable model for learning from long sequences of high‑dimensional tokens, and frames a Bayes‑optimal learning problem in the high‑dimensional limit. It derives exact Bayes‑optimal generalization errors in the extensive‑width regime via a fixed‑point system for order parameters (e.g., $q$ and $\hat q$) and provides a Generalized AMP algorithm (GAMP‑RIE) that achieves these limits, highlighting the advantage of exploiting sequence/embedding structure over naive vectorization. The study further compares Bayes‑optimal performance to linear regression and minimal nuclear norm estimators, showing substantial gains when the width is finite and the prior is structured; it also investigates gradient descent dynamics, revealing that averaged GD can reach Bayes‑optimal performance in the noiseless setting, while noise introduces richer and more delicate behavior. By connecting BSR to transformer/Mixer backbones and outlining clear phase transitions (strong/weak recovery) controlled by width, aspect ratio, and sample complexity, the paper provides a principled theoretical foundation for learning from long sequences and informs the design of sequence architectures and training algorithms with attention to structural priors.

Abstract

Current progress in artificial intelligence is centered around so-called large language models that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (aka generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens, and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.

Bilinear Sequence Regression: A Model for Learning from Long Sequences of High-dimensional Tokens

TL;DR

This work introduces Bilinear Sequence Regression (BSR) as a tractable, analytically solvable model for learning from long sequences of high‑dimensional tokens, and frames a Bayes‑optimal learning problem in the high‑dimensional limit. It derives exact Bayes‑optimal generalization errors in the extensive‑width regime via a fixed‑point system for order parameters (e.g., and ) and provides a Generalized AMP algorithm (GAMP‑RIE) that achieves these limits, highlighting the advantage of exploiting sequence/embedding structure over naive vectorization. The study further compares Bayes‑optimal performance to linear regression and minimal nuclear norm estimators, showing substantial gains when the width is finite and the prior is structured; it also investigates gradient descent dynamics, revealing that averaged GD can reach Bayes‑optimal performance in the noiseless setting, while noise introduces richer and more delicate behavior. By connecting BSR to transformer/Mixer backbones and outlining clear phase transitions (strong/weak recovery) controlled by width, aspect ratio, and sample complexity, the paper provides a principled theoretical foundation for learning from long sequences and informs the design of sequence architectures and training algorithms with attention to structural priors.

Abstract

Current progress in artificial intelligence is centered around so-called large language models that consist of neural networks processing long sequences of high-dimensional vectors called tokens. Statistical physics provides powerful tools to study the functioning of learning with neural networks and has played a recognized role in the development of modern machine learning. The statistical physics approach relies on simplified and analytically tractable models of data. However, simple tractable models for long sequences of high-dimensional tokens are largely underexplored. Inspired by the crucial role models such as the single-layer teacher-student perceptron (aka generalized linear regression) played in the theory of fully connected neural networks, in this paper, we introduce and study the bilinear sequence regression (BSR) as one of the most basic models for sequences of tokens. We note that modern architectures naturally subsume the BSR model due to the skip connections. Building on recent methodological progress, we compute the Bayes-optimal generalization error for the model in the limit of long sequences of high-dimensional tokens, and provide a message-passing algorithm that matches this performance. We quantify the improvement that optimal learning brings with respect to vectorizing the sequence of tokens and learning via simple linear regression. We also unveil surprising properties of the gradient descent algorithms in the BSR model.

Paper Structure

This paper contains 42 sections, 174 equations, 12 figures, 1 table, 1 algorithm.

Figures (12)

  • Figure 1: Bayes Optimal test error for the BSR model with noiseless output channel ($\Delta = 0$) as a function of the sample ratio $\alpha = n/(dL)$ (left column) and of the low-width sample ratio $\bar{\alpha} = n / [r(d+L)]$ (right column). We plot a different value of the aspect ratio $\beta = \max(d,L) / \min(d,L) = 1, 5, +\infty$ for each row from top to bottom, and in each panel compare several values of the width ratio $\rho = r / \min(d,L) = 0.05, 0.1, 0.2, 0.5, 1, 2$ (colored solid lines). In the left column, we also plot for comparison the performance of optimally-regularised linear regression (in this case, $\lambda \to 0^+$) on the vectorized data (it does not depend on $\rho$ and $\beta$) in the black dashed line, which corresponds also to the BO error for $\rho \to \infty$. We observe that the BO test error is always better than the linear regression test error, and that it gets better and better as $\rho$ decreases: the more structure in the distribution of the signal, i.e. the lower the width, the better one can estimate it. We also observe that the BO test error vanishes at a finite value of $\alpha$, the so-called strong recovery threshold, and that this threshold is smaller than one for $\rho < 1$. In this regime, there are values of $\alpha$ for which the BO estimator achieves zero test error, while the linear regression estimator has a non-zero test error. The middle and bottom panel show the same overall phenomenology as $\beta$ increases from 1 to infinity. The right column shows the same curves as a function of the low-width sample ratio $\bar{\alpha} = n / [r(d+L)]$, comparing with the already known low-width BO test error (solid black line) schulke16. We observe a clear convergence to the low-width error curve as $\rho \to 0$, but we highlight that, for e.g. at $\beta = 1$, the test error of the BO estimator is still quantitatively better than its low-width counterpart already at $\rho = 0.05$. Notice also that for $\rho \to 0$ the BO estimator has a weak recovery threshold at $\bar{\alpha}_{\rm weak} = (1+\Delta)\sqrt{\beta}/(1+\beta)$, i.e. below it has the same performance as the zero estimator ${\hat{S}}_{\rm zero}(\mathcal{D}) = 0$. As soon as $\rho > 0$, the weak recovery threshold disappears, allowing for better-than-trivial performance at all values of $\bar{\alpha}$. The weak recovery threshold is marked by a vertical black marker: notice that for $\beta \to \infty$ the weak recovery threshold is at zero. The Bayes-optimal curves are plotted using Result \ref{['prop:mmse_BSR']} and \ref{['eqres1']} for extensive width and Previous result \ref{['prevreslow']} for intensive width. Linear regression is plotted using Previous result \ref{['prevres-regr']}.
  • Figure 2: Bayes Optimal test error for the BSR model with noiseless output channel ($\Delta = 0$) as a function of the sample ratio $\alpha = n/(dL)$. We plot a different value of the width ratio $\rho = r / \min(d,L) = 0.05, 0.5, 1, 2$ in each panel, and several values of the aspect ratio $\beta = \max(d,L) / \min(d,L) = 1, 5, +\infty$ (colored solid lines) in all panels. The black dashed line is the performance of optimally-regularised linear regression on the vectorized data. Again we observe that the more structured signals (larger $\beta$ and smaller $\rho$), the better the achieved test errors. The Bayes-optimal curves are plotted using Result \ref{['prop:mmse_BSR']} and \ref{['eqres1']}. Linear regression is plotted using Previous result \ref{['prevres-regr']}.
  • Figure 3: Comparison between the BO test error and the test error of GAMP-RIE (Algorithm \ref{['alg-main']}) for two choices of the aspect ratio $\beta = \max(d,L) / \min(d,L) = 1, 2$ and the width ratio $\rho = r / \min(d,L) = 0.2, 1$ both in the noiseless $\Delta = 0$ (left) and noisy $\Delta = 0.1$ (right) case. Solid lines are the theoretical prediction from \ref{['eqres1']}. The crosses represent numerical experiments for the test error measured after iterating GAMP-RIE until convergence, on instances of size $\min(d,L) = 100$, with initialisation from the prior distribution. Each point is a run over a single realisation of the data and ground truth. The Bayes-optimal curves are plotted using Result \ref{['prop:mmse_BSR']} and \ref{['eqres1']}. The performance of GAMP-RIE is given by \ref{['eq:est']} applied to the output of Algorithm \ref{['alg-main']}.
  • Figure 4: Comparison between the BO strong recovery threshold (Result \ref{['res:strong']}, solid lines) and the MNNE strong recovery threshold (Previous Result \ref{['mnne_res']}, dashed lines) for $\beta = \max(d,L) / \min(d,L) = 1, 2, 10$ as a function of the width ratio $\rho = r / \min(d,L)$. In the left panel we plot the strong recovery thresholds in the scaling $\alpha = n/(dL)$, natural in the extensive-width case $\rho >0$. In the right panel we plot the same data in the low-width sample scaling $\bar{\alpha} = n / [r(d+L)]$, highlighting the strong suboptimality of MNNE at low ranks/widths. The colored markers on the vertical axis highlight the finite $\rho \to 0$ limit of the strong recovery threshold of MNNE, as given in \ref{['eq-mnne-strong-low']}.
  • Figure 5: Comparison between the test error achieved by GD and AGD initialised in the prior and of the BO test error for $\beta = 1$, $\rho = 0.2$ and $\Delta = 0,0.5$ (left and right panels respectively). In the noisy case, we depict the test error minus the variance of the noise $\Delta$. Solid lines are the BO test error, dashed lines are twice the BO test error corresponding to the error of the Gibbs sampler. Orange crosses are numerical experiments for the test errors at the end of the run of GD for $d=L = 100$, maximum number of steps $\tau = 50000$, and runs are averaged over 16 instances of the data. Blue squares are numerical experiments for the test errors at the end of the run of AGD (averaged over 32 initial conditions) and they are averaged over 2 instances of the data (8 in the right panel up to $\alpha = 0.6$). The error bars denoting standard error on the mean are negligible. In both cases, a fine-tuned value of the learning rate $\eta(\alpha)$ must be used, dependent on the sample ratio $\alpha$. We provide the values used to generate this plot in Appendix \ref{['app.additional']}. The green crosses mark the value of the training loss at the end of the training for GD. The grey vertical line in the right-hand panel marks where the number of samples equals the number of degrees of freedom. Finally, red dots are numerical experiments for GAMP-RIE, with a single random instance of $d= L = 100$. We observe that in the noiseless case $\Delta = 0$ (left), GD achieves a test error compatible with the error of the Gibbs sampler and that AGD achieves a test error compatible with the BO test error. Instead, for $\Delta = 0.5$ (right) we observe that AGD does not reach the BO error, and moreover it trivialises (namely, all differently-initialised runs of GD converge to the same estimator) for $\alpha$ large enough, roughly around $\alpha \approx 1$ here. We show qualitatively similar comparisons at $\beta = 2$ in Appendix \ref{['app.additional']}. The Bayes-optimal curves are plotted using Result \ref{['prop:mmse_BSR']} and \ref{['eqres1']}. The performance of GAMP-RIE, GD and AGD are given by \ref{['eq:gen']} applied to the output of the respective algorithm.
  • ...and 7 more figures