Table of Contents
Fetching ...

Krause Synchronization Transformers

Jingkun Liu, Yisong Yue, Max Welling, Yue Song

TL;DR

Krause Attention replaces global dot-product attention with distance-based, bounded-confidence interactions, promoting local, selective token coupling and mitigating attention sinks. Grounded in Krause consensus dynamics and analyzed via interacting-particle and mean-field frameworks, it yields multi-cluster token coordination and linear-time complexity $O(N W d)$ instead of $O(N^2 d)$. Empirically, Krause Transformers show consistent gains across vision, autoregressive generation, and LLMs, with improved robustness and reduced computation. This approach introduces a principled inductive bias for scalable, dynamics-aware Transformer design, preserving diversity while enabling coherent local coordination.

Abstract

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

Krause Synchronization Transformers

TL;DR

Krause Attention replaces global dot-product attention with distance-based, bounded-confidence interactions, promoting local, selective token coupling and mitigating attention sinks. Grounded in Krause consensus dynamics and analyzed via interacting-particle and mean-field frameworks, it yields multi-cluster token coordination and linear-time complexity instead of . Empirically, Krause Transformers show consistent gains across vision, autoregressive generation, and LLMs, with improved robustness and reduced computation. This approach introduces a principled inductive bias for scalable, dynamics-aware Transformer design, preserving diversity while enabling coherent local coordination.

Abstract

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.
Paper Structure (40 sections, 36 equations, 11 figures, 21 tables, 1 algorithm)

This paper contains 40 sections, 36 equations, 11 figures, 21 tables, 1 algorithm.

Figures (11)

  • Figure 1: Krause Attention, grounded in bounded-confidence interactions, promotes localized multi-cluster synchronization (top). In contrast, standard self-attention tends to induce globally coupled dynamics that concentrate attention onto a dominant mode, often manifesting as attention sinks xiao2023efficient (bottom).
  • Figure 2: Krause Attention computes RBF affinity scores, restricts updates to local neighborhoods, and applies top-$k$ selective interactions.
  • Figure 3: Krause Attention yields more diverse attention heads.
  • Figure 4: Unconditional samples generated by KARM on MNIST.
  • Figure 5: Samples completed by KARMs on CIFAR-10.
  • ...and 6 more figures