Table of Contents
Fetching ...

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis

TL;DR

Infusion demonstrates that model behavior can be steered by small, strategically perturbed training documents using influence-function estimates. The framework identifies influential training data, computes gradient-based perturbations, and validates effects via partial retraining; it achieves reliable target-behavior shifts on CIFAR-10 with as little as $0.2\%$ ($100/45{,}000$) of the data, and demonstrates cross-architecture transfer and behavior amplification in structured problems like Caesar cipher. In transformer and language-model settings, Infusion yields measurable likelihood shifts but tends to struggle to flip predictions at current scales, highlighting a scale-dependent threat. These findings underscore the importance of data provenance and influence-based defenses to mitigate training-time threats while expanding the landscape of data-poisoning techniques.

Abstract

Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

TL;DR

Infusion demonstrates that model behavior can be steered by small, strategically perturbed training documents using influence-function estimates. The framework identifies influential training data, computes gradient-based perturbations, and validates effects via partial retraining; it achieves reliable target-behavior shifts on CIFAR-10 with as little as () of the data, and demonstrates cross-architecture transfer and behavior amplification in structured problems like Caesar cipher. In transformer and language-model settings, Infusion yields measurable likelihood shifts but tends to struggle to flip predictions at current scales, highlighting a scale-dependent threat. These findings underscore the importance of data provenance and influence-based defenses to mitigate training-time threats while expanding the landscape of data-poisoning techniques.

Abstract

Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.
Paper Structure (38 sections, 22 equations, 14 figures, 1 table)

This paper contains 38 sections, 22 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: The Infusion pipeline. Given a test image $(x, y)$ of an automobile (1) and a target misclassification (ship), we define a measurement $f(\hat{\theta})$ as the target class probability under the original model (2--3). Using EK-FAC influence estimation, we identify the $k$ training examples most influential for this measurement (4). We then compute perturbations $\delta$ via projected gradient descent that maximize the predicted change in $f$ (5), yielding infused training examples $z + \delta$ (6). Retraining for one epoch on the modified corpus (7) produces a new model with shifted loss landscape $L(\theta^*)$ (8), where the target class probability has increased substantially on our test image whilst keeping all other model behavior nearly unchanged (9). Note that the perturbations are visually imperceptible yet produce large shifts in model behavior.
  • Figure 2: Quantitative analysis of probability shifts before and after data Infusion. Left: Heatmap of mean $\Delta P(\text{target})$ for each (true class, target class) pair, averaged over 20 test images per cell. Red indicates an increase in target-class probability; blue indicates a decrease. Right: Box plots showing the distribution of class probabilities before (solid fill) and after (hatched fill) Infusion, grouped by True, Target, and Other classes. Whiskers span the 5th--95th percentiles; diamonds indicate the mean.
  • Figure 3: Comparison of $\Delta p$ (target class probability change) across baseline methods. Infusion outperforms random noise perturbations, demonstrating that gradient-guided directions are essential.
  • Figure 4: Classwise transfer of Infusion perturbations. Each heatmap shows the best $\Delta p$ (change in target-class probability after retraining on infused data) across all (true label, target class) pairs, for each of the four source--evaluator combinations. Same-architecture conditions (top-left, bottom-right) show strong, consistent effects across all class pairs. Cross-architecture transfer is weaker but non-zero: notably, CNN$\to$ResNet (bottom-left) shows positive transfer across some class pairs, suggesting that CNN-computed perturbations capture features that generalize to residual architectures.
  • Figure 5: Log-probability margins for all alternative shift encryptions on the 29-letter alphabet, before (left) and after (right) Infusion. Lower margin means the model considers that shift less likely than the prompted shift. The target shift (9, red) improves most, but the model remains confident in the correct answer (16, green, at $\sim y=0$---circled at the top of each figure).
  • ...and 9 more figures