Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation
Randall Balestriero, Romain Cosentino, Sarath Shekkizhar
TL;DR
This work introduces a geometric lens to understand large language models, deriving the intrinsic dimension of multi-head attention embeddings and the MLP-driven partitioning of the MHA manifold. By showing that MHA outputs live in convex hulls and their Minkowski sums, the authors connect prompt structure to model expressivity and vulnerability to RLHF safeguards. They further recast MLPs as continuous piecewise affine splines, deriving seven unsupervised features per layer that capture prompt geometry and enable low-latency toxicity detection, achieving state-of-the-art results on Omni-Toxic and Jigsaw benchmarks. The study provides both theoretical insights and practical tools, including code, for safer AI and more robust interpretations of LLM behavior under varying prompts and architectures.
Abstract
Large Language Models (LLMs) drive current AI breakthroughs despite very little being known about their internal representations. In this work, we propose to shed the light on LLMs inner mechanisms through the lens of geometry. In particular, we develop in closed form $(i)$ the intrinsic dimension in which the Multi-Head Attention embeddings are constrained to exist and $(ii)$ the partition and per-region affine mappings of the feedforward (MLP) network of LLMs' layers. Our theoretical findings further enable the design of novel principled solutions applicable to state-of-the-art LLMs. First, we show that, through our geometric understanding, we can bypass LLMs' RLHF protection by controlling the embedding's intrinsic dimension through informed prompt manipulation. Second, we derive interpretable geometrical features that can be extracted from any (pre-trained) LLM, providing a rich abstract representation of their inputs. We observe that these features are sufficient to help solve toxicity detection, and even allow the identification of various types of toxicity. Our results demonstrate how, even in large-scale regimes, exact theoretical results can answer practical questions in LLMs. Code: https://github.com/RandallBalestriero/SplineLLM
