Table of Contents
Fetching ...

Free Energy Mixer

Jiecheng Lu, Shihao Yang

TL;DR

The paper targets the limitation that standard attention stores full KV caches losslessly but reads them through a single per-head convex combination, which hinders per-channel index selection. It introduces the Free Energy Mixer (FEM), a variational read that, for each channel, optimizes a posterior over past indices using a value-driven tilt: F_{t,j}(β) = (1/β) log ∑_{i∈M_t} p_t(i) exp(β v_{i,j}), with posterior q^{(β)}_{t,j}(i) ∝ p_t(i) exp(β v_{i,j}). FEM preserves the base complexity and can smoothly interpolate from averaging to near hard indexing via a learnable per-channel temperature, implemented through a two-level gating scheme (inner temperature λ_t and outer gate g_t) and optional low-rank local conditioning. The approach is compatible with various priors (softmax, linear attention, RNNs, SSMs) and acts as a universal fast-weight programmer by enabling channel-wise, value-aware cross-token competition without altering asymptotic costs. Empirically, FEM variants outperform strong baselines across NLP, vision, and time-series tasks at matched parameter budgets, with key gains arising from the LSE and temperature components; the method also demonstrates favorable latency and memory characteristics in practical settings.

Abstract

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the $(q,k)$ scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ($O(T^2)$ for softmax; $O(T)$ for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.

Free Energy Mixer

TL;DR

The paper targets the limitation that standard attention stores full KV caches losslessly but reads them through a single per-head convex combination, which hinders per-channel index selection. It introduces the Free Energy Mixer (FEM), a variational read that, for each channel, optimizes a posterior over past indices using a value-driven tilt: F_{t,j}(β) = (1/β) log ∑_{i∈M_t} p_t(i) exp(β v_{i,j}), with posterior q^{(β)}_{t,j}(i) ∝ p_t(i) exp(β v_{i,j}). FEM preserves the base complexity and can smoothly interpolate from averaging to near hard indexing via a learnable per-channel temperature, implemented through a two-level gating scheme (inner temperature λ_t and outer gate g_t) and optional low-rank local conditioning. The approach is compatible with various priors (softmax, linear attention, RNNs, SSMs) and acts as a universal fast-weight programmer by enabling channel-wise, value-aware cross-token competition without altering asymptotic costs. Empirically, FEM variants outperform strong baselines across NLP, vision, and time-series tasks at matched parameter budgets, with key gains arising from the LSE and temperature components; the method also demonstrates favorable latency and memory characteristics in practical settings.

Abstract

Standard attention stores keys/values losslessly but reads them via a per-head convex average, blocking channel-wise selection. We propose the Free Energy Mixer (FEM): a free-energy (log-sum-exp) read that applies a value-driven, per-channel log-linear tilt to a fast prior (e.g., from queries/keys in standard attention) over indices. Unlike methods that attempt to improve and enrich the scoring distribution, FEM treats it as a prior and yields a value-aware posterior read at unchanged complexity, smoothly moving from averaging to per-channel selection as the learnable inverse temperature increases, while still preserving parallelism and the original asymptotic complexity ( for softmax; for linearizable variants). We instantiate a two-level gated FEM that is plug-and-play with standard and linear attention, linear RNNs and SSMs. It consistently outperforms strong baselines on NLP, vision, and time-series at matched parameter budgets.
Paper Structure (121 sections, 28 theorems, 94 equations, 3 figures, 7 tables)

This paper contains 121 sections, 28 theorems, 94 equations, 3 figures, 7 tables.

Key Result

Lemma 2.2

Let $\bm m_t=(\max_{i\le t}v_{i,1},\dots,\max_{i\le t}v_{i,D})$. If $\bm m_t\in\mathrm{conv}\{\bm v_1,\dots,\bm v_t\}$, then a single index simultaneously attains all coordinate maxima. Hence if the arg-max indices differ across coordinates, $\bm m_t\notin\mathrm{conv}\{\bm v_1,\dots,\bm v_t\}$.

Figures (3)

  • Figure 1: (a) Classic attention stores past values losslessly but reads them as a single convex combination, so channel-wise indexing (e.g., per-channel argmax) is not representable. (b) Free Energy Mixer (FEM) treats selection as a DV free-energy problem: values tilt the prior to a value-aware posterior with a learnable per-channel temperature, enabling low-entropy (point-like) posteriors and channel-wise selection while preserving the prior’s time complexity. (c) Common fixes (more heads, deeper stacks, separable mixers, and per-channel scoring) either keep channels synchronized or raise cost / rely on fixed-state storage; none close the lossy-memory gap that FEM addresses.
  • Figure 2: Overview of the Two-Level Gated Free Energy Mixer. (a) Lightweight linear & low-rank local convolution for local conditioning. (b) Prior selection: softmax attention uses a probability normalizer, while linear RNN/SSM use an operator-induced normalizer. (c) FEM integrated into a Pre-Norm Transformer block. (d) Final architecture: compute mean $\mu_t$ and max-temperature branch $F_t^{\max}$, with inner gate $\lambda_t$ interpolating and outer gate $g_t$ scaling. (e) Free-energy curve: improvement over $\mu_t$ equals $\mathrm{KL}(p_t\Vert q^{(\beta)})/\beta$. (f) Efficient implementation: one mixing with $p_t$ yields both $\mathbb{E}_{p_t}[v]$ and $\beta_{\max}^{-1}\log \mathbb{E}_{p_t}[e^{\beta_{\max} v}]$, then gating produces $o_t$.
  • Figure 3: Single-layer toy per-channel argmax task. Left: validation MSE over training steps. Right: per-channel index accuracy. FEM rapidly fits the channel-wise argmax, while a softmax attention layer stays near chance level, reflecting the limitation of convex mixing.

Theorems & Definitions (44)

  • Definition 2.1: Channel-wise selector
  • Lemma 2.2
  • Corollary 2.3
  • Lemma 2.4
  • Lemma 2.5
  • Proposition 2.6: Selection budget
  • Proposition 2.7: Token‑separable mixers are convexly constrained
  • Theorem 2.8: Free-energy selection and budget duality (with $\beta$ as inverse temperature)
  • Proposition 2.9: Complexity-preserving normalization
  • Proposition C.1: Dimension‑dependent linearization and memory collapse for a softmax head
  • ...and 34 more