Competition for attention predicts good-to-bad tipping in AI

Neil F. Johnson; Frank Y. Huo

Competition for attention predicts good-to-bad tipping in AI

Neil F. Johnson, Frank Y. Huo

TL;DR

Probing the tipping risk of offline edge LLMs, the paper develops a predictive framework where competition for attention between good and bad basins drives a tipping point $n^*$ that can be forecast from dot-product geometry. Using penultimate-layer embeddings and domain-specific basins, the authors derive $n^*$ (Eq. 2) and validate it across six decoder-only transformers and production-scale data, showing robust directional predictions and a near-boundary diagnostic. They also show that tipping can be steered by conversation history and content injections, enabling cost-effective real-time monitoring that does not require cloud tools. The work offers domain-portable safety monitoring and actionable levers to delay or prevent harmful tipping across languages and contexts.

Abstract

More than half the global population now carries devices that can run ChatGPT-like language models with no Internet connection and minimal safety oversight -- and hence the potential to promote self-harm, financial losses and extremism among other dangers. Existing safety tools either require cloud connectivity or discover failures only after harm has occurred. Here we show that a large class of potentially dangerous tipping originates at the atomistic scale in such edge AI due to competition for the machinery's attention. This yields a mathematical formula for the dynamical tipping point n*, governed by dot-product competition for attention between the conversation's context and competing output basins, that reveals new control levers. Validated against multiple AI models, the mechanism can be instantiated for different definitions of 'good' and 'bad' and hence in principle applies across domains (e.g. health, law, finance, defense), changing legal landscapes (e.g. EU, UK, US and state level), languages, and cultural settings.

Competition for attention predicts good-to-bad tipping in AI

TL;DR

Probing the tipping risk of offline edge LLMs, the paper develops a predictive framework where competition for attention between good and bad basins drives a tipping point

that can be forecast from dot-product geometry. Using penultimate-layer embeddings and domain-specific basins, the authors derive

(Eq. 2) and validate it across six decoder-only transformers and production-scale data, showing robust directional predictions and a near-boundary diagnostic. They also show that tipping can be steered by conversation history and content injections, enabling cost-effective real-time monitoring that does not require cloud tools. The work offers domain-portable safety monitoring and actionable levers to delay or prevent harmful tipping across languages and contexts.

Abstract

Paper Structure (7 sections, 2 equations, 3 figures, 1 table)

This paper contains 7 sections, 2 equations, 3 figures, 1 table.

Model and generation.
Embedding extraction and basin construction.
Metrics.
Bootstrap and statistical assessment.
Dual-codebase robustness.
CCDH mapping.
Real-time deployment cost.

Figures (3)

Figure 1: Good-to-bad tipping during conversation with AI. (a) Schematic of on-device deployment where model runs locally with no Internet connection and no safety oversight. (b) Examples of 1-step conversations across topics with an LLM representative of models now running on air-gapped phones and edge devices. Tipping from good ($\bf{B}$) to bad ($\bf{D}$) content can be immediate or follow a run of good output. (XXXX Bank is HSBC Bank).
Figure 2: Multi-step conversations exhibit history-dependent tipping. Introduction of additional content (e.g. $\bf C$) by user can steer AI's response to a given question between good ($\bf{B}$) and bad ($\bf{D}$). The mechanism is the same competition for attention between ${\bf B}$ and ${\bf D}$ as in Fig. 1(b), now operating across a two-way conversation that evolves over time.
Figure 3: (a) Practical implementation of lite controller. (b) Schematic of competing basins for outputs ${\bf B}$ and ${\bf D}$ given input ${\bf A}$. Using centroids ${\bf A}=(0.4,-0.3)$, ${\bf B}=(0.8,0)$, ${\bf D}=(0.9,0.5)$, panels (c) and (d) show predicted output for a conversation ${\bf ACCA}$ in which content ${\bf C}$ has been introduced. (c) Aligning ${\bf C}$ toward ${\bf D}$ decreases $n^*$ and hence favors tipping to output ${\bf D}$. Example ${\bf C}=(0.2,0.2)$ is shown; $n^*=1$. (d) Aligning ${\bf C}$ away from ${\bf D}$ increases $n^*$, as predicted by Eq. 2, and hence delays tipping to output ${\bf D}$. Example ${\bf C}=(-0.2,-0.2)$ is shown. Examples shown for other inputs.

Competition for attention predicts good-to-bad tipping in AI

TL;DR

Abstract

Competition for attention predicts good-to-bad tipping in AI

Authors

TL;DR

Abstract

Table of Contents

Figures (3)