Selective Synchronization Attention

Hasi Hays

Selective Synchronization Attention

Hasi Hays

TL;DR

This work proposes Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators.

Abstract

The Transformer architecture has become the foundation of modern deep learning, yet its core self-attention mechanism suffers from quadratic computational complexity and lacks grounding in biological neural computation. We propose Selective Synchronization Attention (SSA), a novel attention mechanism that replaces the standard dot-product self-attention with a closed-form operator derived from the steady-state solution of the Kuramoto model of coupled oscillators. In SSA, each token is represented as an oscillator characterized by a learnable natural frequency and phase; the synchronization strength between token pairs, determined by a frequency-dependent coupling and phase-locking condition, serves as the attention weight. This formulation provides three key advantages: (i) natural sparsity arising from the phase-locking threshold, whereby tokens with incompatible frequencies automatically receive zero attention weight without explicit masking; (ii) unified positional-semantic encoding through the natural frequency spectrum, eliminating the need for separate positional encodings; and (iii) a single-pass, closed-form computation that avoids iterative ODE integration, with all components (coupling, order parameter, synchronization) derived from the oscillatory framework. We instantiate SSA within the Oscillatory Synchronization Network (OSN), a drop-in replacement for the Transformer block. Analysis of the synchronization matrices reveals non-uniform, head-diverse coupling patterns even at initialization, demonstrating a stronger architectural inductive bias than the approximately uniform attention produced by randomly initialized Transformers.

Selective Synchronization Attention

TL;DR

Abstract

Paper Structure (41 sections, 4 theorems, 22 equations, 5 figures, 1 table)

This paper contains 41 sections, 4 theorems, 22 equations, 5 figures, 1 table.

Introduction
Related Work
Efficient Attention Mechanisms
Attention Alternatives
Oscillatory Models in Machine Learning
Neuroscience of Oscillatory Attention
Method
Preliminaries: The Kuramoto Model
Selective Synchronization Attention
Phase and frequency initialization.
Frequency-dependent coupling.
Closed-form phase-alignment operator.
Output computation.
Multi-Frequency Synchronization Heads
The OSN Block
...and 26 more sections

Key Result

Proposition 1

The computational complexity of SSA is:

Figures (5)

Figure 1: Selective Synchronization Attention: conceptual illustration for the sentence "The cat sat on the mat." (a) Phase space: tokens are represented as oscillators on the phase circle. Tokens with similar natural frequencies ($\omega_1$ or $\omega_2$) synchronize into coherent clusters (red and blue), enabling information exchange via solid connections. Adjacent tokens from different clusters ("sat" and "on") remain desynchronized ($S_{ij} = 0$, dashed line), producing natural sparsity. (b) Frequency space: the two clusters occupy distinct regions, separated by the phase-locking threshold $|\Delta\omega| > KrJ$. (c) Synchronization matrix $\mathbf{S}$: the resulting block-diagonal structure, with strong within-cluster synchronization and zero cross-cluster weights.
Figure 2: Architecture comparison. Left: Standard Transformer block with multi-head dot-product attention and explicit positional encoding. Right: OSN block with Multi-Frequency Synchronization Heads using the closed-form phase-alignment operator. The blocks share identical input-output interfaces ($\mathbb{R}^{N \times D} \to \mathbb{R}^{N \times D}$), enabling drop-in replacement.
Figure 3: Single-block GPU benchmark (NVIDIA A100, Google Colab). Left: Throughput (K tokens/sec) vs. sequence length. Center: Forward pass latency (ms). Right: Peak GPU memory (MB). Batch sizes adapted per sequence length ($B = 8$ for $N \leq 512$; $B = 4, 2, 1$ for $N = 1024, 2048, 4096$). Configuration: $D = 512$, $H = 8$, $k = 64$ for sparse variant. Each measurement averaged over 50 trials with 10 warmup iterations.
Figure 4: Synchronization matrices $\mathbf{S}^{(h)}$ from eight Multi-Frequency Synchronization Heads for a randomly initialized OSN block ($D = 512$, $H = 8$, NVIDIA A100). Each panel shows the first $128 \times 128$ tokens of a 512-length sequence. The diagonal reflects self-synchronization ($S_{ii} = 1$), while off-diagonal values encode the frequency-dependent coupling and phase-locking condition. Different heads exhibit distinct synchronization profiles, demonstrating multi-frequency head diversity.
Figure 5: Distribution of the empirical order parameter $r$ across eight synchronization heads, computed from 100 random input samples ($N = 256$, $D = 512$, NVIDIA A100). The order parameter concentrates around $r \approx 0.847$ with low variance ($\sigma \approx 0.002$), demonstrating stable emergent phase coherence consistent with the Kuramoto formulation.

Theorems & Definitions (7)

Proposition 1: Complexity of SSA
Theorem 1: Universality
Theorem 2: Natural Sparsity
Proposition 2: Positional Encoding Bias
proof
proof
proof

Selective Synchronization Attention

TL;DR

Abstract

Selective Synchronization Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (7)