Table of Contents
Fetching ...

Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations

Dahye Kim, Deepti Ghadiyaram

TL;DR

This work introduces Concept Steerers, a test-time, model-agnostic framework using k-Sparse Autoencoders to identify and manipulate monosemantic concepts in text embeddings for diffusion-based image generation. By training a k-SAE once on prompt embeddings, the approach derives sparse latent directions that can precisely steer concepts like nudity, violence, styles, and object attributes during inference without retraining the base model. Empirical results show strong improvements in unsafe content removal (notably up to 20.01% robustness gains against adversarial prompts), preservation of image quality, and about 5x faster inference compared to prior methods. The method demonstrates versatility across SD 1.4, SDXL-Turbo, and FLUX, and maintains prompt-image alignment while enabling fine-grained, test-time control.

Abstract

Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of $\mathbf{20.01\%}$ in unsafe concept removal, is effective in style manipulation, and is $\mathbf{\sim5}$x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers

Concept Steerers: Leveraging K-Sparse Autoencoders for Test-Time Controllable Generations

TL;DR

This work introduces Concept Steerers, a test-time, model-agnostic framework using k-Sparse Autoencoders to identify and manipulate monosemantic concepts in text embeddings for diffusion-based image generation. By training a k-SAE once on prompt embeddings, the approach derives sparse latent directions that can precisely steer concepts like nudity, violence, styles, and object attributes during inference without retraining the base model. Empirical results show strong improvements in unsafe content removal (notably up to 20.01% robustness gains against adversarial prompts), preservation of image quality, and about 5x faster inference compared to prior methods. The method demonstrates versatility across SD 1.4, SDXL-Turbo, and FLUX, and maintains prompt-image alignment while enabling fine-grained, test-time control.

Abstract

Despite the remarkable progress in text-to-image generative models, they are prone to adversarial attacks and inadvertently generate unsafe, unethical content. Existing approaches often rely on fine-tuning models to remove specific concepts, which is computationally expensive, lacks scalability, and/or compromises generation quality. In this work, we propose a novel framework leveraging k-sparse autoencoders (k-SAEs) to enable efficient and interpretable concept manipulation in diffusion models. Specifically, we first identify interpretable monosemantic concepts in the latent space of text embeddings and leverage them to precisely steer the generation away or towards a given concept (e.g., nudity) or to introduce a new concept (e.g., photographic style) -- all during test time. Through extensive experiments, we demonstrate that our approach is very simple, requires no retraining of the base model nor LoRA adapters, does not compromise the generation quality, and is robust to adversarial prompt manipulations. Our method yields an improvement of in unsafe concept removal, is effective in style manipulation, and is x faster than the current state-of-the-art. Code is available at: https://github.com/kim-dahye/steerers

Paper Structure

This paper contains 23 sections, 5 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Monosemantic interpretable concepts such as nudity, photographic styles, and object attributes are identified using k-sparse autoencoders (k-SAE). We leverage them to enable precise modification of a desired concept during the generation process, without impacting the overall image structure, photo-realism, visual quality, and prompt alignment (for safe concepts). Our framework can be used to remove unsafe concepts (left), photographic styles (middle), and object attributes (right).
  • Figure 2: K-sparse autoencoder (k-SAE) is trained on feature representations from the text encoder of the diffusion model. Once trained, it serves as a Concept Steerer, enabling precise concept manipulation at test-time. $\lambda$ denotes the strength of the concept.
  • Figure 3: (a) Qualitative comparison of different approaches, including SAFREE yoon2024safree and TraSCE jain2024trasce, on the I2P dataset. Our method removes nudity without significantly altering the generated images, resulting in outputs better aligned with the input prompt. (b) Qualitative examples from the I2P dataset. Our method allows fine-grained control over the removal of specific concepts, removing only the intended concept while preserving the overall structure and style of the generated images. (c) Qualitative examples from the Ring-A-Bell dataset. Our method successfully removes the abstract concept of violence, as shown by the absence of blood in the right images. The images are intentionally blurred for display purposes as they are disturbing.
  • Figure 4: Qualitative examples from the I2P dataset with FLUX. Our method is model-agnostic and can be applied to both U-Net-based SD 1.4 and SDXL-Turbo, as well as DiT-based FLUX.
  • Figure 5: Photographic style manipulation of SD 1.4 for the given prompt "geodesic landscape, john chamberlain, christopher balaskas, tadao ando, 4 k," where concept prompts are "minimalist" (Left) and "zoom-in, magnify" (Right), respectively. On the left, the image is manipulated towards a maximalist style as $\lambda \rightarrow -1$, while it adopts a minimalist style as $\lambda \rightarrow 1$. Similarly, on the right, the image appears zoomed out and becomes blurred as $\lambda \rightarrow -1$, whereas it becomes zoomed in and clearer as $\lambda \rightarrow 1$.
  • ...and 13 more figures