Table of Contents
Fetching ...

LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

Pau Rodriguez, Michal Klein, Eleonora Gualdoni, Valentino Maiorca, Arno Blaas, Luca Zappella, Marco Cuturi, Xavier Suau

TL;DR

LinEAS tackles end-to-end activation steering by learning affine transport maps across multiple layers under a global distributional loss based on sliced Wasserstein distances, using unpaired data. The method integrates sparsity regularization to enable automatic layer and neuron selection, optimizing a proximal SGD objective. Empirically, LinEAS delivers robust toxicity mitigation in LLMs and effective concept steering in text-to-image generation with minimal utility loss, outperforming several weakly-supervised baselines and approaching oracle-based methods in low-data regimes. The approach is modality-agnostic, computationally efficient, and supports compositional steering, making it practical for broad deployment where data is scarce and flexibility is desired.

Abstract

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

TL;DR

LinEAS tackles end-to-end activation steering by learning affine transport maps across multiple layers under a global distributional loss based on sliced Wasserstein distances, using unpaired data. The method integrates sparsity regularization to enable automatic layer and neuron selection, optimizing a proximal SGD objective. Empirically, LinEAS delivers robust toxicity mitigation in LLMs and effective concept steering in text-to-image generation with minimal utility loss, outperforming several weakly-supervised baselines and approaching oracle-based methods in low-data regimes. The approach is modality-agnostic, computationally efficient, and supports compositional steering, making it practical for broad deployment where data is scarce and flexibility is desired.

Abstract

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e.g., produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (i.e., without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e.g., toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

Paper Structure

This paper contains 37 sections, 10 equations, 19 figures, 16 tables, 1 algorithm.

Figures (19)

  • Figure 1: LinEAS learns lightweight maps to steer pretrained model activations. With LinEAS, we gain fine-grained control on text-to-image generation to induce precise styles (in the figure) or remove objects (e.g.,\ref{['sec:results_diffusion']}). The same procedure also allows controlling LLMs (e.g.,\ref{['sec:tox']}).
  • Figure 2: Given a frozen computational graph (blue) of $L+1$ layers of interest, we interlace it with $L$ transport blocks (red). Each transport is defined as a collection of coordinate-wise affine transformations, as displayed in the 3 and 2 boxes for maps $\textcolor{red}{T_{1}^{}}$ and $\textcolor{red}{T_{2}^{}}$ respectively. All transport maps are jointly trained to minimize a sum of distributional losses $\Delta$ between the neural activation distributions collected from samples ${\bm{x}}_1,\dots,{\bm{x}}_n \sim p$ (one shade of grey per sample) and ${\bm{y}}_1,\dots,{\bm{y}}_n\sim q$ (resp. yellow). We learn the parameters of these maps jointly by minimizing the penalized sum of $\Delta$ terms, where $\Delta$ is a 1D Wasserstein distances evaluated on the $d_\ell$ activations of layer $\ell$. Using a global optimization, we can consider sparsifying regularizers (included in $\mathcal{R}$), to, e.g., select a sparse subset of activations that require interventions. For instance, when adding a regularizer that promotes sparsity, both $\textcolor{red}{T_{1}^{}}$ and $\textcolor{red}{T_{2}^{}}$ do not intervene on one neuron, the first and the second, respectively.
  • Figure 2: LinEAS mitigates concepts on DMD2 yin2024dmd2 while staying perceptually similar to the original image. Users prefer LinEAS 63.3% of the time (left) since it maintains a higher fidelity to the non-intervened original model when using the same prompt (center), and matches other methods at concept removal (right). Results were obtained with $\lambda=1$ and they are aggregated across all concepts.
  • Figure 3: LinEAS is effective at low data regime. We study toxicity mitigation (two left-most plots) and utility (two right-most plots) as a function of the amount of data available to learn interventions. LinEAS shows better performance (low toxicity and utility close to original dashed lines) for low data, and stable performance for $N\geq 32$.
  • Figure 4: Sparsity improves utility while mitigating toxicity. Toxicity results on Qwen2.5-7B using only 32 sentences, at different levels of sparsity $\gamma$ that result in different support sizes (x axis). At 1K optimization steps, with a support of about 1% we maintain similar toxicity (left, center-left) while PPL$_{\text{WIK}}$ decreases (center-right) and MMLU increases (right). Note that too long optimizations (10k steps) might harm utility, due to overfitting. Similarly, short optimizations (e.g., 100 steps) and strong sparsity leads to low conditioning (mild toxicity mitigation).
  • ...and 14 more figures