Table of Contents
Fetching ...

Random Quadratic Form on a Sphere: Synchronization by Common Noise

Maximilian Engel, Anna Shalova

Abstract

We introduce the Random Quadratic Form (RQF): a stochastic differential equation which formally corresponds to the gradient flow of a random quadratic functional on a sphere. While the one-point dynamics of the system is a Brownian motion and thus has no preferred direction, the two-point motion exhibits nontrivial synchronizing behaviour. In this work we study synchronization of the RQF, namely we give both distributional and path-wise characterizations of the solutions by studying invariant measures and random attractors of the system. The RQF model is motivated by the study of the role of linear layers in transformers and illustrates the synchronization by common noise phenomena arising in the simplified models of transformers. In particular, we provide an alternative (independent of self-attention) explanation of the clustering behaviour in deep transformers and show that tokens cluster even in the absence of the self-attention mechanism.

Random Quadratic Form on a Sphere: Synchronization by Common Noise

Abstract

We introduce the Random Quadratic Form (RQF): a stochastic differential equation which formally corresponds to the gradient flow of a random quadratic functional on a sphere. While the one-point dynamics of the system is a Brownian motion and thus has no preferred direction, the two-point motion exhibits nontrivial synchronizing behaviour. In this work we study synchronization of the RQF, namely we give both distributional and path-wise characterizations of the solutions by studying invariant measures and random attractors of the system. The RQF model is motivated by the study of the role of linear layers in transformers and illustrates the synchronization by common noise phenomena arising in the simplified models of transformers. In particular, we provide an alternative (independent of self-attention) explanation of the clustering behaviour in deep transformers and show that tokens cluster even in the absence of the self-attention mechanism.
Paper Structure (22 sections, 16 theorems, 105 equations, 2 figures)

This paper contains 22 sections, 16 theorems, 105 equations, 2 figures.

Key Result

Theorem 1.3

Let $M \in \mathop{\mathrm{Sym}}\nolimits^n$ be a symmetric matrix, sampled from the Gaussian Orthogonal Ensemble (cf. eq:GOE). Then, with probability $1$, there exists $x^*\in {\mathbb S}^{n-1}$ such that the gradient flow of the quadratic form eq:intro-dqf satisfies for a.e. initial condition $x_0 \in {\mathbb S}^{n-1}$. In other words, almost every trajectory of the gradient flow $x(t)$ conver

Figures (2)

  • Figure 1: Ensemble of RQFs driven by the same process $Q_t$ from different initial conditions. At time $t \sim 5$ the trajectories approach the random attractor consisting of two antipodal points that further move in time.
  • Figure 2: Solutions of the gradient flow of the deterministic quadratic form. On the left: trajectories with different initial conditions (marked by blue dots). On the right: dynamics of the pairwise scalar products between the trajectories.

Theorems & Definitions (38)

  • Remark 1.1: Back from the random quadratic form to the deterministic one
  • Remark 1.2: Interpretation of the driving functional
  • Theorem 1.3: Deterministic Quadratic Form
  • Theorem 1.4: Random Quadratic Form
  • Theorem 2.1: Deterministic Quadratic Form, mahony1996gradient
  • Remark 2.2: $\dim \Lambda_m = 1$
  • Remark 2.3: Wasserstein gradient flow
  • Remark 2.4: Random Wasserstein gradient flow
  • Definition 3.1: Random dynamical system (RDS)
  • Proposition 3.2: SDE as an RDS
  • ...and 28 more