Table of Contents
Fetching ...

Ordinary Least Squares as an Attention Mechanism

Philippe Goulet Coulombe

TL;DR

This paper reframes ordinary least squares (OLS) as a restricted attention mechanism, revealing that OLS predictions can be obtained from inner-product similarities in a transformed, orthonormal feature space. By decomposing the OLS solution with $\\boldsymbol{X}_{\\text{train}}'\\boldsymbol{X}_{\\text{train}} = \\boldsymbol{U}\\boldsymbol{\\Lambda}\\boldsymbol{U}'$, and defining $\\boldsymbol{F}_{\\text{train}} = \\boldsymbol{X}_{\\text{train}} \\boldsymbol{U}\\Lambda^{-1/2}$ and $\\boldsymbol{F}_{\\text{test}} = \\boldsymbol{X}_{\\text{test}} \\boldsymbol{U}\\Lambda^{-1/2}$, predictions reduce to $\\hat{\\boldsymbol{y}}_{\\text{test}} = \\boldsymbol{F}_{\\text{test}}\\boldsymbol{F}_{\\text{train}}' \\boldsymbol{y}_{\\text{train}}$, i.e., a proximity-based weighting akin to attention weights. The paper then extends this view to nonlinear attention via softmax, dimensionality reduction (principal components regression), ridge regularization, and connects with core Transformer elements such as multi-head and self-attention, masking, and positional encoding. It provides experimental evidence showing that a nonlinear attention regression framework can achieve competitive out-of-sample performance relative to standard nonlinear models across diverse data-generating processes, highlighting the practical relevance of viewing attention as a regression-style, similarity-based embedding problem. Overall, the work bridges statistics and modern NLP architectures, offering a lens to interpret and potentially improve attention-based models using traditional regression insights.

Abstract

I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.

Ordinary Least Squares as an Attention Mechanism

TL;DR

This paper reframes ordinary least squares (OLS) as a restricted attention mechanism, revealing that OLS predictions can be obtained from inner-product similarities in a transformed, orthonormal feature space. By decomposing the OLS solution with , and defining and , predictions reduce to , i.e., a proximity-based weighting akin to attention weights. The paper then extends this view to nonlinear attention via softmax, dimensionality reduction (principal components regression), ridge regularization, and connects with core Transformer elements such as multi-head and self-attention, masking, and positional encoding. It provides experimental evidence showing that a nonlinear attention regression framework can achieve competitive out-of-sample performance relative to standard nonlinear models across diverse data-generating processes, highlighting the practical relevance of viewing attention as a regression-style, similarity-based embedding problem. Overall, the work bridges statistics and modern NLP architectures, offering a lens to interpret and potentially improve attention-based models using traditional regression insights.

Abstract

I show that ordinary least squares (OLS) predictions can be rewritten as the output of a restricted attention module, akin to those forming the backbone of large language models. This connection offers an alternative perspective on attention beyond the conventional information retrieval framework, making it more accessible to researchers and analysts with a background in traditional statistics. It falls into place when OLS is framed as a similarity-based method in a transformed regressor space, distinct from the standard view based on partial correlations. In fact, the OLS solution can be recast as the outcome of an alternative problem: minimizing squared prediction errors by optimizing the embedding space in which training and test vectors are compared via inner products. Rather than estimating coefficients directly, we equivalently learn optimal encoding and decoding operations for predictors. From this vantage point, OLS maps naturally onto the query-key-value structure of attention mechanisms. Building on this foundation, I discuss key elements of Transformer-style attention and draw connections to classic ideas from time series econometrics.

Paper Structure

This paper contains 37 sections, 43 equations, 2 figures.

Figures (2)

  • Figure 1: OLS Predictions as a Linear Attention Module
  • Figure 2: Model Performance Comparison Across Data Generating Processes