Scaling Laws for Adversarial Attacks on Language Model Activations
Stanislav Fort
TL;DR
The paper investigates adversarial attacks that perturb language model activations (residual streams) to deterministically steer the next $t$ tokens. It formalizes a linear scaling law, $t_ ext{max} = oldsymbol{ppa}\,a$, linking attack length $a$ to controllable output length, and extends it to fractional and multi-attack settings through $t_ ext{max}=oldsymbol{ppa}\,f a/n$. A geometric interpretation shows vulnerability arises from input-output dimensionality mismatch, with attack resistance $oldsymbol{hi} = rac{d p}{oldsymbol{ppa} ext{log}_2 V}$ remaining roughly constant across model scales, and activation attacks proving substantially stronger than token substitutions. The findings have practical implications for multi-modal and retrieval-heavy systems, revealing a broad activation-based attack surface and guiding defense by balancing activation dimensionality and output-space complexity. Overall, the work provides both empirical scaling laws and a unifying dimensionality perspective on adversarial vulnerability in language models.$
Abstract
We explore a class of adversarial attacks targeting the activations of language models. By manipulating a relatively small subset of model activations, $a$, we demonstrate the ability to control the exact prediction of a significant number (in some cases up to 1000) of subsequent tokens $t$. We empirically verify a scaling law where the maximum number of target tokens $t_\mathrm{max}$ predicted depends linearly on the number of tokens $a$ whose activations the attacker controls as $t_\mathrm{max} = κa$. We find that the number of bits of control in the input space needed to control a single bit in the output space (what we call attack resistance $χ$) is remarkably constant between $\approx 16$ and $\approx 25$ over 2 orders of magnitude of model sizes for different language models. Compared to attacks on tokens, attacks on activations are predictably much stronger, however, we identify a surprising regularity where one bit of input steered either via activations or via tokens is able to exert control over a similar amount of output bits. This gives support for the hypothesis that adversarial attacks are a consequence of dimensionality mismatch between the input and output spaces. A practical implication of the ease of attacking language model activations instead of tokens is for multi-modal and selected retrieval models, where additional data sources are added as activations directly, sidestepping the tokenized input. This opens up a new, broad attack surface. By using language models as a controllable test-bed to study adversarial attacks, we were able to experiment with input-output dimensions that are inaccessible in computer vision, especially where the output dimension dominates.
