Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Randall Balestriero; Romain Cosentino; Sarath Shekkizhar

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Randall Balestriero, Romain Cosentino, Sarath Shekkizhar

TL;DR

This work introduces a geometric lens to understand large language models, deriving the intrinsic dimension of multi-head attention embeddings and the MLP-driven partitioning of the MHA manifold. By showing that MHA outputs live in convex hulls and their Minkowski sums, the authors connect prompt structure to model expressivity and vulnerability to RLHF safeguards. They further recast MLPs as continuous piecewise affine splines, deriving seven unsupervised features per layer that capture prompt geometry and enable low-latency toxicity detection, achieving state-of-the-art results on Omni-Toxic and Jigsaw benchmarks. The study provides both theoretical insights and practical tools, including code, for safer AI and more robust interpretations of LLM behavior under varying prompts and architectures.

Abstract

Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

TL;DR

Abstract

the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and

the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM

Paper Structure (25 sections, 3 theorems, 9 equations, 14 figures, 10 tables)

This paper contains 25 sections, 3 theorems, 9 equations, 14 figures, 10 tables.

Introduction
Related Work
Multi-Head Attention: Minkowsky Sum of Convex Hulls
Multi-Head Attention
The Role of MHA Intrinsic Dimension in Toxic Generation
Setting the ID threshold:
ID-based Jailbreak:
MLPs Linear Regions Characterize Your Prompt
The Affine Spline Hidding in Plain Sight
Spline Features To Characterize LLM Prompts
Application: Low-Latency Toxicity Detection
Application: Jigsaw Challenge
Discussion
Conclusions
Supplementary Materials
...and 10 more sections

Key Result

Lemma 3.1

The $i^{\rm th}$ row of the $h^{\rm th}$ head mapping output ${\rm Head}_h^{(\ell)}({\bm{X}})$ lies within the convex hull, ${\rm Hull}\left\{({\bm{V}}_h^{(\ell)})^\top {\bm{x}}_j, j=1,\dots,i\right\}$, and is of effective dimension at most $\#\left\{{\rm Attn}_h^{(\ell)}({\bm{X}}^{(\ell)})_{i,j}> 0

Figures (14)

Figure 1: Illustration of LLM geometry at a single transformer layer for a $3$-token sequence input, $\left \{{\bm{x}}_1, {\bm{x}}_2, {\bm{x}}_3 \right \}$. Left: We represent the convex hulls induced by $2$ heads projected onto the output layer described in Eq. \ref{['eq:hull_O']}. In each head, the embedding of the $3^{rd}$-token, i.e., corresponding to the forecasted token, is constrained to belong to the associated hull (triangle for each head). Middle: The combination of the heads, Eq. \ref{['eq:multihead']}, induces the Minkowski sum of the single-head convex hulls described in Theorem \ref{['thm:multihead']}, which here defines the depicted hexagon. This is the space where the embedding of the $3^{rd}$ token belongs. For our depiction, we set $({\bm{V}}_2^{(\ell)} {\bm{O}}_2^{(\ell)})^{T}\boldsymbol{x}_1$ as the origin for our depiction, and consequently, $({\bm{V}}_1^{(\ell)} {\bm{O}}_1^{(\ell)})^{T}\boldsymbol{x}_2$ is at the center (interior) of the hexagon. The Minkowski sum is then obtained by translating the lower triangle (green) along the boundaries of the upper triangle (blue). Right: The output of the MHA is mapped onto the unit circle (bias-less layer norm), which is then partitioned by the continuous affine mapping induced by the MLP. Each region (different colors) represents an affine mapping as in Eq. \ref{['eq:CPA']}. Our analysis indicates that enhancing a model's expressiveness can be achieved by either incorporating more attention heads/partitions or by augmenting the number of pertinent tokens within the input sequence. This insight provides a potential rationale for the effectiveness of larger language models and the emergence of in-context learning.
Figure 2: Estimation of intrinsic dimension threshold, $\epsilon$ in \ref{['eq:id_est']}. The plot presents the distribution of the self-attention values (normalized by the max attention value) across all the layers, attention heads, and samples used in our experiments (asian, muslim, violence, bomb making). Our cut-off value, i.e., $0.1 \times a_{max}$ corresponds to the elbow of this distribution.
Figure 3: Visualization of the intrinsic dimension (last layer) of different manipulated prompts: for each subplot all the samples share the same final sentence: a toxic sample from the toxigen dataset (from top left to bottom right the toxic topics are racism / asian, religion / islam, violence / hate, bomb making / Molotov cocktail). Each curve corresponds to prepending the toxic sentence with either an unrelated prompt, a related prompt, or a random tokens prompt, all of different context lengths. We observe that, only when the intrinsic dimension (ID) is getting increased by the prepended prompts, the output generation of the LLM is toxic. We also observe that depending on the topic, the random tokens prepended prompt affects differently the ID, which in turn, does not necessarily lead to toxic generation. A more detailed version of the first suplot is given in \ref{['fig:generation']}.
Figure 4: Visualization of the intrinsic dimension (last layer) of different manipulated prompts: all share the same final sentence--a toxic sample from the toxigen dataset. For the blue line we prepend unrelated sentences and see that (i) the intrinsic dimension remains constant, and the generation remains safe. However, for the red line, we prepend related non-toxic sentences and observe that doing so increases the embedding's intrinsic dimension, as per \ref{['thm:head', 'thm:multihead']}. In the latter case, it is now more likely that we will visit a part of the space that was missed by RLHF, inducing toxic generation. This implies that the number of prompts that RLHF would need to prevent toxic generation grows exponentially with the intrinsic dimension per the curse of dimensionality. Additional results are provided in \ref{['fig:muslim', 'tab:generation1', 'tab:generation2']}.
Figure 5: Percentage of RLHF bypass success, i.e., toxic generation, with prepending random tokens with respect to relative ID change. We consider as input base prompt examples from the Toxigen dataset ($280$ samples having average ID of $140 \pm 27$), along with randomly prepended tokens of length $10$ (iteratively $5\times$ for each base example). For each input, we collect $(i)$ the change in intrinsic dimension of the input with respect to the base prompt, and $(ii)$ the toxicity output generated by the LLM. We evaluate the toxicity of the output generated by prompting Mixtral $8\times7$B Instruct. As evidenced in our earlier results, the higher the ID change, the higher the probability to bypass the RLHF safeguard.
...and 9 more figures

Theorems & Definitions (4)

Lemma 3.1: causal single-head convex hull
Theorem 3.2: causal multi-head Minkowski sum
Proposition 4.1
proof

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

TL;DR

Abstract

Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (4)