Table of Contents
Fetching ...

Exponential Family Attention

Kevin Christian Wibisono, Yixin Wang

TL;DR

Exponential Family Attention (EFA) generalizes self-attention from language to high-dimensional, mixed-type data by coupling attention-derived context with exponential-family conditionals. The approach unifies latent-factor models as special cases while enabling nonlinear, context-dependent interactions and learned context sets through attention. Theoretical contributions include linear identifiability and an excess loss generalization bound, complemented by strong empirical performance on synthetic data, Instacart baskets, MovieLens, and spatiotemporal temperatures. The results suggest EFA’s broad applicability for modeling complex dependencies in non-text domains and its potential to improve predictive reconstructions and recommendations in real-world settings.

Abstract

The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets -- including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings -- we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.

Exponential Family Attention

TL;DR

Exponential Family Attention (EFA) generalizes self-attention from language to high-dimensional, mixed-type data by coupling attention-derived context with exponential-family conditionals. The approach unifies latent-factor models as special cases while enabling nonlinear, context-dependent interactions and learned context sets through attention. Theoretical contributions include linear identifiability and an excess loss generalization bound, complemented by strong empirical performance on synthetic data, Instacart baskets, MovieLens, and spatiotemporal temperatures. The results suggest EFA’s broad applicability for modeling complex dependencies in non-text domains and its potential to improve predictive reconstructions and recommendations in real-world settings.

Abstract

The self-attention mechanism is the backbone of the transformer neural network underlying most large language models. It can capture complex word patterns and long-range dependencies in natural language. This paper introduces exponential family attention (EFA), a probabilistic generative model that extends self-attention to handle high-dimensional sequence, spatial, or spatial-temporal data of mixed data types, including both discrete and continuous observations. The key idea of EFA is to model each observation conditional on all other existing observations, called the context, whose relevance is learned in a data-driven way via an attention-based latent factor model. In particular, unlike static latent embeddings, EFA uses the self-attention mechanism to capture dynamic interactions in the context, where the relevance of each context observations depends on other observations. We establish an identifiability result and provide a generalization guarantee on excess loss for EFA. Across real-world and synthetic data sets -- including U.S. city temperatures, Instacart shopping baskets, and MovieLens ratings -- we find that EFA consistently outperforms existing models in capturing complex latent structures and reconstructing held-out data.

Paper Structure

This paper contains 39 sections, 5 theorems, 56 equations, 10 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Un-der a specific set of parameters, the bidirectional EFA model in Sections sec:self-attn-lang and sec:app-baskets reduces to a variant of linear latent factor models: where $\sigma$ denotes the softmax operator.

Figures (10)

  • Figure 1: Examples of data that can be effectively modeled by exponential family attention (EFA) include: (left) spatiotemporal data, where each time series is linked to a specific attribute (e.g., spatial location); (center) sequential grocery purchase data, capturing the sequence of purchased items and their quantities for each user; and (right) sequential movie rating data, detailing the order of movies rated by each user along with their corresponding ratings.
  • Figure 2: An end-to-end illustration of the exponential family attention (EFA) model. The middle and right panels correspond to the first and second terms of Equation \ref{['eq:decomp']}, respectively. Here, $\sigma(\cdot)$ and $\textrm{Cat}(\cdot)$ refer to the softmax operation and categorical distribution, respectively.
  • Figure 3: 1 basket, 1 layer
  • Figure 4: 1 basket, 2 layer
  • Figure 5: 2 basket, 1 layer
  • ...and 5 more figures

Theorems & Definitions (26)

  • Remark 1: Self-attention and latent factor modeling.
  • Remark 2: Extensions to multi-head multi-layer self-attention model
  • Remark 3: Causal mask ensures unidirectionality
  • Remark 4: Extension to bidirectional modeling
  • Example 1: Market baskets; chen2020studying
  • Remark 5: Uni-/Bi-directional modeling
  • Remark 6: Choices of embedding and unembedding functions
  • Remark 7: Center and context embeddings
  • Remark 8: Extensions to multi-head multi-layer exponential family attention
  • Example 2: Gaussian spatiotemporal time series; rudolph2016exponential
  • ...and 16 more