Table of Contents
Fetching ...

Controlling Language and Diffusion Models by Transporting Activations

Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, Marco Cuturi, Xavier Suau

TL;DR

Activation Transport is introduced, a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities.

Abstract

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Controlling Language and Diffusion Models by Transporting Activations

TL;DR

Activation Transport is introduced, a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities.

Abstract

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Paper Structure

This paper contains 45 sections, 1 theorem, 7 equations, 32 figures, 16 tables.

Key Result

Proposition 3.1

santambrogio2015optimal Let $\rho,\tau\in \mathcal{P}(\mathbb{R})$ be two univariate distributions. For any submodular cost $c:\mathbb{R}\times \mathbb{R}\rightarrow\mathbb{R}$ (i.e., such that $\partial c/\partial x\partial y<0$), the optimal transport map $T$ that can transport $\rho$ to $\tau$ is

Figures (32)

  • Figure 1: Linear-AcT unlocks interpretable controllability for both LLMs and Diffusion, offering explicit control over the strength of conditioning, via a parameter $\lambda$ between 0 (no transport) and 1 (full transport).
  • Figure 2: Transport maps using different methods. For distributions with $\sigma_a = \sigma_b$ (left) all methods (except ActAdd) are equivalent. When $\sigma_a \neq \sigma_b$ (right), vector-based methods (e.g.,ActAdd, ITI-c, Mean-AcT) diverge from the map defined by the samples. ActAdd shows a bias since it only uses one sample pair. The linear estimator is robust to differences in $\sigma$.
  • Figure 3: Actual $\sigma_a,\sigma_b$ for toxic and non-toxic sentences on Gemma2-2B, showing that $\sigma_a \neq \sigma_b$ in real scenarios.
  • Figure 4: Concept induction using AcT (post-LN layers) and ITI-c (attention layers) on Gemma2-2B. We aggregate results over 7 WordNet concepts, generating 500 sentences at different intervention strength levels. We report concept presence with LLM-as-a-judge ($p(yes)$), and the PPL of the generated sentences using Mistral-7B. We plot the median (and 25/75 quantile band) across concepts and generations per level, showing that Linear-AcT achieves a peak of concept induction at $\lambda\approx1$, which is inline with our OT formulation. Other methods show different maxima.
  • Figure 5: Linear-AcT allows controlled conditioning of SDXL and FLUX. "A cat resting on a laptop keyboard in a bedroom." SDXL (left) and FLUX (right) intervened with ITI-c (top), Mean-AcT (middle) and Linear-AcT (bottom) for the concept cyberpunk, with a $\lambda$ strength in $[0,1]$. The image with the best $\lambda$ (according to the highest 0-shot score in \ref{['fig:clip_score']}) is shown right. Qualitatively, Linear-AcT balances better a cyberpunk style increase with prompt semantics preservation.
  • ...and 27 more figures

Theorems & Definitions (3)

  • Proposition 3.1: Univariate Transport Maps
  • Definition 3.1: Linear-AcT
  • Definition 3.2: Affine Causal Transport Map