On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
Haozhe Jiang, Nika Haghtalab
TL;DR
The paper investigates whether modern neural networks are surjective, asking if any arbitrary output can be produced by some input. It adopts an almost-everywhere notion of surjectivity and develops a differential-topology framework to analyze common building blocks. The authors prove that Pre-LayerNorm-wrapped blocks, MLPs with LeakyReLU activations, and linear attention are almost surely surjective, while attention itself is not, implying that transformers and diffusion models possess inverse mappings for arbitrary outputs in principle. They discuss the profound safety implications, including jailbreak vulnerabilities and the difficulty of fully safeguarding high-capability models, and outline theoretical and practical considerations for safety interventions across language, vision, and robotics domains.
Abstract
Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.
