Table of Contents
Fetching ...

Manifold of Failure: Behavioral Attraction Basins in Language Models

Sarthak Munshi, Manish Bhatt, Vineeth Sai Narajala, Idan Habler, AmmarnAl-Kahfah, Ken Huang, Blake Gatto

TL;DR

This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models, using MAP-Elites to illuminate the continuous topology of these failure regions, which are term behavioral attraction basins, and reframe the search for vulnerabilities as a quality diversity problem.

Abstract

While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.

Manifold of Failure: Behavioral Attraction Basins in Language Models

TL;DR

This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models, using MAP-Elites to illuminate the continuous topology of these failure regions, which are term behavioral attraction basins, and reframe the search for vulnerabilities as a quality diversity problem.

Abstract

While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
Paper Structure (26 sections, 1 equation, 7 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 1 equation, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Two-dimensional behavioral space $\mathcal{B}$. The $x$-axis captures query indirection (direct to metaphorical) and the $y$-axis captures authority framing (none to administrator). Each prompt maps to a point in this space via its behavioral descriptor.
  • Figure 2: System architecture. MAP-Elites selects and mutates prompts from the behavioral archive. Each prompt is sent to the target LLM, and the response is evaluated by the judge to produce a behavioral descriptor $(b)$ and Alignment Deviation score $Q(p)$, which update the archive.
  • Figure 3: 2D behavioral heatmaps reveal model-specific topological signatures (white = safe, dark red = high deviation). (a) Llama-3-8B exhibits a near-universal vulnerability plateau at AD $\approx$ 1.0. (b) GPT-OSS-20B shows spatially concentrated basins in the low-indirection region. (c) GPT-5-Mini shows uniform moderate deviation with a hard ceiling at AD = 0.50.
  • Figure 4: Contour plots showing iso-AD lines across the behavioral space. Horizontal banding at specific $a_2$ values is visible in all models, suggesting authority framing is a critical parameter for alignment. (a) Llama-3-8B: narrow safe corridors within a high-AD landscape. (b) GPT-OSS-20B: localized "bullseye" patterns with nested contour rings. (c) GPT-5-Mini: compressed contours in a narrow AD range (0.39--0.50).
  • Figure 5: Basin maps where red indicates AD $>$ 0.5 (attraction basin) and green indicates AD $\leq$ 0.5 (safe); white cells are unexplored. (a) Llama-3-8B: 93.9% of cells are basins. (b) GPT-OSS-20B: 64.3% of cells are basins, with a complex spatial pattern. (c) GPT-5-Mini: 0% of cells are basins; the model never exceeds the threshold.
  • ...and 2 more figures