Table of Contents
Fetching ...

Attention as an Adaptive Filter

Peter Racioppo

TL;DR

Adaptive Filter Attention (AFA) reframes attention as a parallelized Kalman-filter-like estimation over a linear stochastic dynamics model. By modeling tokens as observations from a linear Itô SDE with diagonalizable dynamics, AFA propagates uncertainty via a closed-form differential Lyapunov solution and derives attention weights from robust, residual-based reweightings of propagated query-key precisions. The framework yields a tensor-form attention (and an isotropic simplification Isotropic AFA) with comparable time/memory complexity to standard attention, and reveals a complex-valued, rotary-positional encoding limit when decays and process noise vanish. A radial-tangential SDE (RT-SDE) variant further allows structured anisotropic noise with closed-form propagation, enabling a maximum-likelihood interpretation of Transformer-like computation on a hypersphere. Overall, AFA unifies attention with adaptive filtering, enriching temporal structure, uncertainty propagation, and principled MLE-based attention, while maintaining practical scalability through several simplifying assumptions. The work connects to rotary embeddings, complex-valued networks, and structured state-space methods, offering a path toward dynamics-aware, uncertainty-robust attention mechanisms for long-context modeling.

Abstract

We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By assuming a continuous-time linear time-invariant system with simultaneously-diagonalizable state matrices and noise covariances, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries. A generalization of attention naturally arises as the maximum likelihood solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions. We further constrain the system dynamics and noise in order to obtain a simplified variant with the same computational and memory complexity as standard attention. In the limit of zero decay and process noise, and using a small-angle approximation, we recover a complex-valued generalization of ordinary dot-product attention with rotary positional encodings.

Attention as an Adaptive Filter

TL;DR

Adaptive Filter Attention (AFA) reframes attention as a parallelized Kalman-filter-like estimation over a linear stochastic dynamics model. By modeling tokens as observations from a linear Itô SDE with diagonalizable dynamics, AFA propagates uncertainty via a closed-form differential Lyapunov solution and derives attention weights from robust, residual-based reweightings of propagated query-key precisions. The framework yields a tensor-form attention (and an isotropic simplification Isotropic AFA) with comparable time/memory complexity to standard attention, and reveals a complex-valued, rotary-positional encoding limit when decays and process noise vanish. A radial-tangential SDE (RT-SDE) variant further allows structured anisotropic noise with closed-form propagation, enabling a maximum-likelihood interpretation of Transformer-like computation on a hypersphere. Overall, AFA unifies attention with adaptive filtering, enriching temporal structure, uncertainty propagation, and principled MLE-based attention, while maintaining practical scalability through several simplifying assumptions. The work connects to rotary embeddings, complex-valued networks, and structured state-space methods, offering a path toward dynamics-aware, uncertainty-robust attention mechanisms for long-context modeling.

Abstract

We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By assuming a continuous-time linear time-invariant system with simultaneously-diagonalizable state matrices and noise covariances, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries. A generalization of attention naturally arises as the maximum likelihood solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions. We further constrain the system dynamics and noise in order to obtain a simplified variant with the same computational and memory complexity as standard attention. In the limit of zero decay and process noise, and using a small-angle approximation, we recover a complex-valued generalization of ordinary dot-product attention with rotary positional encodings.

Paper Structure

This paper contains 59 sections, 427 equations, 3 figures, 4 algorithms.

Figures (3)

  • Figure 1: Filter performance on different 2D systems: ground-truth trajectory (black), measured (blue), and predicted (red). (a) system with only measurement noise ($\sigma^2 = 0.0, \, \eta^2 = 1.0$). (b) system with both process noise and measurement noise ($\sigma^2 = 0.3, \, \eta^2 = 0.5$). (c) higher noise ($\sigma^2 = 0.5, \, \eta^2 = 2.0$).
  • Figure 2: Comparison of an AFA layer's "pulled-forward" state estimates at different stages of training. The true trajectory is shown as a solid black line, and the pulled-forward estimates as colored point clouds. (a) State estimates early in training. (b) State estimates midway through training. (c) State estimates after training is complete.
  • Figure 3: Attention matrices produced by training standard attention and AFA on a 2D LTI with process and measurement noise. (a) First layer of standard attention (b) Second layer of standard attention. (c) Single layer of AFA.