Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond

Frank Yingjie Huo; Neil F. Johnson

Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond

Frank Yingjie Huo, Neil F. Johnson

TL;DR

A first-principles physics theory of the AI engine at the heart of LLMs' 'magic' (e.g. ChatGPT, Claude): the basic Attention head is derived, which allows a quantitative analysis of outstanding AI challenges such as output repetition, hallucination and harmful content, and bias.

Abstract

We derive a first-principles physics theory of the AI engine at the heart of LLMs' 'magic' (e.g. ChatGPT, Claude): the basic Attention head. The theory allows a quantitative analysis of outstanding AI challenges such as output repetition, hallucination and harmful content, and bias (e.g. from training and fine-tuning). Its predictions are consistent with large-scale LLM outputs. Its 2-body form suggests why LLMs work so well, but hints that a generalized 3-body Attention would make such AI work even better. Its similarity to a spin-bath means that existing Physics expertise could immediately be harnessed to help Society ensure AI is trustworthy and resilient to manipulation.

Capturing AI's Attention: Physics of Repetition, Hallucination, Bias and Beyond

TL;DR

Abstract

Paper Structure

This paper contains 6 equations, 4 figures.

Figures (4)

Figure 1: (a) Attention, shown here in its most basic form, is used across all generative AI because it works (e.g. LLMs such as ChatGPT). However there is no first-principles theory for why it works and when it won't. See End Matter for explanations of its terminology which is unusual for physics. (b) The 'physics' of this Attention process that emerges exactly from our first-principles derivation. Each spin ${\vb*{S}}_i$ is exactly equivalent to a token in an embedding space whose structure reflects the prior training that the AI (LLM etc.) received. Wiggly lines are the effective 2-body interactions that emerge from Eq. \ref{['eq:1']}. (c) The Context Vector $\vb*{N}^{(0)}$ is exactly equivalent to a bath-projected form of the 2-spin Hamiltonian (Eq. \ref{['eq:1']}) which is then weighted toward the sub-region of the bath featuring the input spins. The theory predicts how a bias (e.g. from pre-training or fine tuning the LLM) can perturb $\vb*{N}^{(0)}$ so that the trained LLM's output is dominated by inappropriate vs. appropriate content (e.g. 'bad' such as "THEY ARE EVIL" vs. 'good'). Figures \ref{['fig:3']},\ref{['fig:4']} show this phase boundary in detail.
Figure 2: Next-word prediction for basic Attention (Fig. \ref{['fig:1']}(a)). Upper panel: first iteration. Lower panel: sixth iteration. For simplicity, we use a 4-word vocabulary (e.g. $\mathtt{A,B,C,D}$) embedded in $\mathbb{R}^3$ as $\vb*{A} = (0.1,0.2,0.3),\ \vb*{B}=(0.4,0.1,0.6),\ \vb*{C}=(0.7,0.6,0.5),\ \vb*{D}=(1.0,1.1,0.3)$. Initial prompt is $\mathtt{ACB}$, and we take all coefficient matrices $\mathsf{W}_Q, \mathsf{W}_K, \mathsf{W}_V = \mathbb{I}$ without affecting the core functionality of Attention. The 4 vectors are plotted together with a specifically normalized $\vb*{N}^{(0)}$, on a 2-dimensional projected plane spanned by $\vb*{N}^{(0)}$ and $\vb*{A} = (0.1,0.2,0.3)$. For both iteration stages, $\vb*{D}$ (i.e. token $\mathtt{D}$) acts like an attractor: it has the largest projection on $\vb*{N}^{(0)}$ (blue dashed lines). As the iterations increase, $\vb*{D}$'s attractor status is reinforced, as can be seen from the increasing alignment between $\vb*{D}$ and $\vb*{N}^{(0)}$.
Figure 3: Phase diagram for the example of a 3-dimensional token embedding given a 4-word vocabulary: $\mathtt{THEY}=(0.25,0.25,0.1),\ \mathtt{ARE}=(0.1,0.3,0.2),\ \mathtt{GOOD}=(0.4,0.3,0.1)$. Again for simplicity, $\mathsf{W}_Q, \mathsf{W}_K, \mathsf{W}_V = \mathbb{I}$. The output's content remains 'good' (GOOD) as long as the 'bad' (EVIL) token stays in the blue regime on the left. But if EVIL appears in the red regime, the output's content suddenly flips to 'bad' (EVIL).
Figure 4: (a) Phase boundaries (Fig. \ref{['fig:3']}) with increasing linear biases $\xi = 0, 0.025, 0.05$ (see End Matter for $\boldsymbol{\delta}$). The change of phase boundary can induce a dramatic change in the output content since the red token now becomes a highly likely (and repeated) output, while the blue becomes highly unlikely. (b) Phase boundaries with positional encoding $(P_i)_{2m+1} = \sin({i}/{1000^{2m/d}}),\ (P_i)_{2m+2} = \cos({i}/{1000^{2m/d}})$, weight $y=0.1$, for the first 100 iterations of token generation. $\texttt{EVIL}=(0.4,0.15,0.4)$. Phase boundaries generally rotate counterclockwise about the attractor (GOOD) with increasing iterations, until they cross token EVIL which then becomes the new attractor. Subsequent rotations center around token EVIL. Generated tokens are hence GOOD before the attractor change, and EVIL after. In both panels, token embeddings are same as Fig. \ref{['fig:3']}; $x=0.4$ for simplicity.