Attention as an Adaptive Filter
Peter Racioppo
TL;DR
Adaptive Filter Attention (AFA) reframes attention as a parallelized Kalman-filter-like estimation over a linear stochastic dynamics model. By modeling tokens as observations from a linear Itô SDE with diagonalizable dynamics, AFA propagates uncertainty via a closed-form differential Lyapunov solution and derives attention weights from robust, residual-based reweightings of propagated query-key precisions. The framework yields a tensor-form attention (and an isotropic simplification Isotropic AFA) with comparable time/memory complexity to standard attention, and reveals a complex-valued, rotary-positional encoding limit when decays and process noise vanish. A radial-tangential SDE (RT-SDE) variant further allows structured anisotropic noise with closed-form propagation, enabling a maximum-likelihood interpretation of Transformer-like computation on a hypersphere. Overall, AFA unifies attention with adaptive filtering, enriching temporal structure, uncertainty propagation, and principled MLE-based attention, while maintaining practical scalability through several simplifying assumptions. The work connects to rotary embeddings, complex-valued networks, and structured state-space methods, offering a path toward dynamics-aware, uncertainty-robust attention mechanisms for long-context modeling.
Abstract
We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By assuming a continuous-time linear time-invariant system with simultaneously-diagonalizable state matrices and noise covariances, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries. A generalization of attention naturally arises as the maximum likelihood solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions. We further constrain the system dynamics and noise in order to obtain a simplified variant with the same computational and memory complexity as standard attention. In the limit of zero decay and process noise, and using a small-angle approximation, we recover a complex-valued generalization of ordinary dot-product attention with rotary positional encodings.
