Scaling Laws for Adversarial Attacks on Language Model Activations

Stanislav Fort

Scaling Laws for Adversarial Attacks on Language Model Activations

Stanislav Fort

TL;DR

The paper investigates adversarial attacks that perturb language model activations (residual streams) to deterministically steer the next $t$ tokens. It formalizes a linear scaling law, $t_ ext{max} = oldsymbol{ppa}\,a$, linking attack length $a$ to controllable output length, and extends it to fractional and multi-attack settings through $t_ ext{max}=oldsymbol{ppa}\,f a/n$. A geometric interpretation shows vulnerability arises from input-output dimensionality mismatch, with attack resistance $oldsymbol{hi} = rac{d p}{oldsymbol{ppa} ext{log}_2 V}$ remaining roughly constant across model scales, and activation attacks proving substantially stronger than token substitutions. The findings have practical implications for multi-modal and retrieval-heavy systems, revealing a broad activation-based attack surface and guiding defense by balancing activation dimensionality and output-space complexity. Overall, the work provides both empirical scaling laws and a unifying dimensionality perspective on adversarial vulnerability in language models.$

Abstract

We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = κa$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $χ$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.

Scaling Laws for Adversarial Attacks on Language Model Activations

TL;DR

The paper investigates adversarial attacks that perturb language model activations (residual streams) to deterministically steer the next

tokens. It formalizes a linear scaling law,

, linking attack length

to controllable output length, and extends it to fractional and multi-attack settings through

. A geometric interpretation shows vulnerability arises from input-output dimensionality mismatch, with attack resistance

remaining roughly constant across model scales, and activation attacks proving substantially stronger than token substitutions. The findings have practical implications for multi-modal and retrieval-heavy systems, revealing a broad activation-based attack surface and guiding defense by balancing activation dimensionality and output-space complexity. Overall, the work provides both empirical scaling laws and a unifying dimensionality perspective on adversarial vulnerability in language models.$

Abstract

We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations,

, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens

. We empirically verify a scaling law where the maximum number of target tokens

predicted depends linearly on the number of tokens

whose activations the attacker controls as

. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance

) is remarkably constant between

and

over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.

Paper Structure (22 sections, 11 equations, 16 figures, 2 tables, 2 algorithms)

This paper contains 22 sections, 11 equations, 16 figures, 2 tables, 2 algorithms.

Introduction
Theory
Problem setup
Attacking activation vectors
Input and output space dimensions
Scaling laws
Comparison to token-level substitution attacks
Method
Problem setup
An attack on activations
Loss evaluation and optimization
Estimating the attack multiplier $\kappa$
Token substitution attacks
Attack and target separation within the context
Results and Discussion
...and 7 more sections

Figures (16)

Figure 1: (Left panel) A diagram showing an attack on the activations (blue vectors) of a language model that leads to the change of the predicted next token from species to friend. (Right panel) The maximum number of tokens whose values can be set precisely, $t_\mathrm{max}$, scales linearly with the number of attack tokens $a$.
Figure 2: The difference between having fewer or the same number of classes than attack dimensions (on the left) and more classes than dimensions (on the right). In the former case, neighboring cells of all different classes are common, allowing for easy to find adversarial attacks.
Figure 3: An illustration of the space of activations being partitioned into regions that get mapped to different $t$-token output sequences $\approx$ our output classes.
Figure 4: A diagram showing the $t=3$ multi-token target prediction after an attack on $a=2$ token activations.
Figure 5: A summary of adversarial attacks on activations of EleutherAI/pythia-1.4b-v0. Only experiments varying the attack length $a$ (in tokens whose activations the attacker controls) and the multiplicity of context and target pairs the attack has to succeed on, $n$, are shown. The estimated attack multiplier is $\kappa = 119.0 \pm 2.9$ which means that controlling a single token worth of activations on the input allows the attacker to determine $\approx119$ tokens on the output.
...and 11 more figures

Scaling Laws for Adversarial Attacks on Language Model Activations

TL;DR

Abstract

Scaling Laws for Adversarial Attacks on Language Model Activations

Authors

TL;DR

Abstract

Table of Contents

Figures (16)