Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Joshua Rozner; Cory Shain

Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Joshua Rozner, Cory Shain

Abstract

Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.

Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Abstract

Paper Structure (50 sections, 3 equations, 17 figures, 9 tables)

This paper contains 50 sections, 3 equations, 17 figures, 9 tables.

Introduction
Methods
Perturbation
Remapping
Perturbation
Evaluation
Interpretation
Experimental Design
Tasks and Benchmarks
Implementation
Evaluation
Morphological Representations
Lexical Representations
Syntactic Representations
Discussion
...and 35 more sections

Figures (17)

Figure 1: A, B: Transfer for Morphology task in RoBERTa-large. A is randomly initialized; B is the trained model. Cell $i, j$ corresponds to finetuning on the $i$th remapping and evaluating on the $j$th. Items are grouped hierarchically by morphological class (outer, deverbal vs. comparative) and tokenization (inner, e.g., Sing. means singly tokenized, -er means the final token is -er, etc.). C: Distribution of transfer effects (log-odds) to deverbal (magenta) vs. comparative (cyan) depending on which class is perturbed ($x$-axis), with the median indicated in each distribution.
Figure 2: Similarity matrices for square (WSD) using perturbation (left) and cosine (right). Block structure corresponds to four senses: shape, company, town square, numerical square (e.g., 4,9). AUCs: 0.94, 0.87
Figure 3: Transfer (aggregated over animacy and embeddedness conditions) from each FG and control to FG, (minimal) non-FG conditions, and controls. FG and non-FG exhibit similar transfer through step 1k, after which transfer to non-FG begins decreasing. Once FG is learned (step 2k-5k), minimal transfer is seen to/from controls.
Figure 4: A: Transfer between indicated classes over training. Std errors are too small to be seen. B: Transfer between indicated classes in untrained (magenta) and trained (cyan) models, with medians shown.
Figure A1: Results for all models, with untrained on left, trained in middle, and effect distribution (log-odds) in trained models on right. Effect sizes in the heatmaps are clipped to 7,3,3,3,3 respectively (effect sizes vary based on learning rate, which was tuned per model) so that the diagonal does not hide the rest of the heatmap structure. The final row (rob-large) is the same as is shown in the main text.
...and 12 more figures

Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Abstract

Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Authors

Abstract

Table of Contents

Figures (17)