Table of Contents
Fetching ...

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński, Tomasz Trzciński, Franziska Boenisch, Adam Dziedzic

TL;DR

Conditioned Activation Transport (CAT) is proposed, a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps that conditioning transport maps to activate only within unsafe activation regions, to minimize interference with benign queries.

Abstract

Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.

Conditioned Activation Transport for T2I Safety Steering

TL;DR

Conditioned Activation Transport (CAT) is proposed, a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps that conditioning transport maps to activate only within unsafe activation regions, to minimize interference with benign queries.

Abstract

Despite their impressive capabilities, current Text-to-Image (T2I) models remain prone to generating unsafe and toxic content. While activation steering offers a promising inference-time intervention, we observe that linear activation steering frequently degrades image quality when applied to benign prompts. To address this trade-off, we first construct SafeSteerDataset, a contrastive dataset containing 2300 safe and unsafe prompt pairs with high cosine similarity. Leveraging this data, we propose Conditioned Activation Transport (CAT), a framework that employs a geometry-based conditioning mechanism and nonlinear transport maps. By conditioning transport maps to activate only within unsafe activation regions, we minimize interference with benign queries. We validate our approach on two state-of-the-art architectures: Z-Image and Infinity. Experiments demonstrate that CAT generalizes effectively across these backbones, significantly reducing Attack Success Rate while maintaining image fidelity compared to unsteered generations. Warning: This paper contains potentially offensive text and images.
Paper Structure (34 sections, 6 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 34 sections, 6 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: ActAdd rimsky2023steering and Linear-ACT steering rodriguez2025controlling fail to remove harmful content or alter the semantic content of images. CAT suppresses unsafe content without compromising an image's quality or semantics.
  • Figure 2: Comparison of Transport Maps on Synthetic Manifolds. We evaluate ActAdd, Linear-ACT, and our MLP Transport against the Safe Target (Green). (1) Simple Gaussian: All methods successfully align with the target. (2) Variance Mismatch:ActAdd fails to rotate the distribution while Linear-ACT compresses the variance into a thin line. MLP matches the target geometry. (3) The Moon:Linear-ACT shrinks the crescent into the target distribution range but fails to unbend the topology. MLP morphs the shape correctly. (4) Multi-Modal XOR: Global linear methods scatter clusters due to conflicting directions. MLP Transport correctly maps each cluster to its local target.
  • Figure 3: ActAdd fails to remove key unsafe concepts, while CAT suppresses them without degrading image quality.
  • Figure 4: Comparison of no steering and CAT on the Infinity model. CAT steering precisely eliminates unsafe visual elements (blood, violence, robbery, suicide) while preserving the surrounding scene and background context.
  • Figure 5: Comparison of no steering and CAT on the Z-Image model. CAT steering precisely eliminates unsafe visual elements (nudity, blood, gore) while preserving the surrounding scene and background context.
  • ...and 2 more figures