Table of Contents
Fetching ...

Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

Yaoteng Tan, Zikui Cai, M. Salman Asif

Abstract

Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models

Abstract

Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.

Paper Structure

This paper contains 23 sections, 11 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the proposed steering framework. We take off-the-shelf foundation models such as CLIP or a VLM as a plug-in semantic energy estimator at inference-time to steer unwanted generations. Our method is training-free that does not perform any model weight update, and dataset-free that does not require any domain-specific image datasets beyond user provided blacklist of target concepts in textual formations.
  • Figure 2: Visual examples of $\hat{\mathbf{x}}_{0|t}$ across different generation steps for identity-related prompts. For running a fixed number of $50$ inference steps, the identity can be visually observed at step $10$ (i.e., $t= 0.8T$), where semantic attributes emerge in early generation stage that motivates us to utilize them as semantic energy to manipulate generation, before the generation collapsing into high-fidelity rendering.
  • Figure 3: NSFW classification probability in $\hat{\mathbf{x}}_{0|t}$ across generation steps $t$. The red and green curves represent classifier scores for unsafe and safe prompts averaged over a dataset (100 prompts). High probabilities at early stages suggest that pretrained CLIP and VLM provide effective energy signals for steering targeted concepts (e.g., nudity).
  • Figure 4: Quantitative evaluation of steering single ID. The heatmap colors represent the average classification probability scores, where darker regions indicate lower ID verifier confidence in the targeted identity and lighter regions indicate higher confidence. The high contrast between the diagonal and off-diagonal elements indicates our steering framework achieves precise targeted ID suppression and non-targeted ID preservations.
  • Figure 5: Quantitative evaluation of joint steering multi-ID. The heatmap colors represent the average classification probability scores, where darker regions indicate lower confidence and lighter regions indicate higher confidence in the targeted IDs. As shown in the joint steering of 2 up to 9 IDs, our method effectively steers multiple targeted IDs simultaneously while ensuring that untargeted IDs are generated without compromise.
  • ...and 4 more figures