Table of Contents
Fetching ...

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

Haozhe Jiang, Nika Haghtalab

TL;DR

The paper investigates whether modern neural networks are surjective, asking if any arbitrary output can be produced by some input. It adopts an almost-everywhere notion of surjectivity and develops a differential-topology framework to analyze common building blocks. The authors prove that Pre-LayerNorm-wrapped blocks, MLPs with LeakyReLU activations, and linear attention are almost surely surjective, while attention itself is not, implying that transformers and diffusion models possess inverse mappings for arbitrary outputs in principle. They discuss the profound safety implications, including jailbreak vulnerabilities and the difficulty of fully safeguarding high-capability models, and outline theoretical and practical considerations for safety interventions across language, vision, and robotics domains.

Abstract

Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.

On Surjectivity of Neural Networks: Can you elicit any behavior from your model?

TL;DR

The paper investigates whether modern neural networks are surjective, asking if any arbitrary output can be produced by some input. It adopts an almost-everywhere notion of surjectivity and develops a differential-topology framework to analyze common building blocks. The authors prove that Pre-LayerNorm-wrapped blocks, MLPs with LeakyReLU activations, and linear attention are almost surely surjective, while attention itself is not, implying that transformers and diffusion models possess inverse mappings for arbitrary outputs in principle. They discuss the profound safety implications, including jailbreak vulnerabilities and the difficulty of fully safeguarding high-capability models, and outline theoretical and practical considerations for safety interventions across language, vision, and robotics domains.

Abstract

Given a trained neural network, can any specified output be generated by some input? Equivalently, does the network correspond to a function that is surjective? In generative models, surjectivity implies that any output, including harmful or undesirable content, can in principle be generated by the networks, raising concerns about model safety and jailbreak vulnerabilities. In this paper, we prove that many fundamental building blocks of modern neural architectures, such as networks with pre-layer normalization and linear-attention modules, are almost always surjective. As corollaries, widely used generative frameworks, including GPT-style transformers and diffusion models with deterministic ODE solvers, admit inverse mappings for arbitrary outputs. By studying surjectivity of these modern and commonly used neural architectures, we contribute a formalism that sheds light on their unavoidable vulnerability to a broad class of adversarial attacks.

Paper Structure

This paper contains 31 sections, 18 theorems, 30 equations, 1 table, 1 algorithm.

Key Result

Theorem 2.1

Let $B^d(R)=\left\{x\in\mathbb{R}^d\middle|\|x\|\leq R\right\}$ be a $d$-dimensional ball with radius $R$. For every continuous function $f:B^d(R)\to B^d(R)$, there exists $x\in B^d(R)$ such that $f(x)=x$.

Theorems & Definitions (39)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 2.1: Brouwer's Fixed Point Theorem
  • Definition 5
  • Theorem 2.2: Inverse Function Theorem
  • Definition 6
  • Definition 7
  • Lemma 1
  • ...and 29 more