Table of Contents
Fetching ...

SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

Ruiyang Zhang, Jiahao Luo, Xiaoru Feng, Qiufan Pang, Yaodong Yang, Juntao Dai

TL;DR

This work addresses safety in text-to-image generation by moving beyond pre- or post-prompt filtering to a post-hoc safety editing paradigm. It introduces MR-SafeEdit, a large, multi-round image-text interleaved dataset, and SafeEditor, a unified multimodal LLM trained to iteratively edit unsafe generations while preserving user intent. The approach reduces over-refusal and achieves a favorable safety-utility balance, demonstrated across multiple datasets and generation models, and it remains model-agnostic, functioning as a plug-in at the output stage. The contributions include the dataset, the SafeEditor model, comprehensive experiments and ablations, and a discussion of limitations and directions for future work in safety alignment for multi-modal generation.

Abstract

With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

SafeEditor: Unified MLLM for Efficient Post-hoc T2I Safety Editing

TL;DR

This work addresses safety in text-to-image generation by moving beyond pre- or post-prompt filtering to a post-hoc safety editing paradigm. It introduces MR-SafeEdit, a large, multi-round image-text interleaved dataset, and SafeEditor, a unified multimodal LLM trained to iteratively edit unsafe generations while preserving user intent. The approach reduces over-refusal and achieves a favorable safety-utility balance, demonstrated across multiple datasets and generation models, and it remains model-agnostic, functioning as a plug-in at the output stage. The contributions include the dataset, the SafeEditor model, comprehensive experiments and ablations, and a discussion of limitations and directions for future work in safety alignment for multi-modal generation.

Abstract

With the rapid advancement of text-to-image (T2I) models, ensuring their safety has become increasingly critical. Existing safety approaches can be categorized into training-time and inference-time methods. While inference-time methods are widely adopted due to their cost-effectiveness, they often suffer from limitations such as over-refusal and imbalance between safety and utility. To address these challenges, we propose a multi-round safety editing framework that functions as a model-agnostic, plug-and-play module, enabling efficient safety alignment for any text-to-image model. Central to this framework is MR-SafeEdit, a multi-round image-text interleaved dataset specifically constructed for safety editing in text-to-image generation. We introduce a post-hoc safety editing paradigm that mirrors the human cognitive process of identifying and refining unsafe content. To instantiate this paradigm, we develop SafeEditor, a unified MLLM capable of multi-round safety editing on generated images. Experimental results show that SafeEditor surpasses prior safety approaches by reducing over-refusal while achieving a more favorable safety-utility balance.

Paper Structure

This paper contains 51 sections, 23 figures, 3 tables.

Figures (23)

  • Figure 1: (a) Filter-based methods can raise rejection at both input and output stages, which significantly increases over-refusal. Prompt editing methods also declines instruction following. SafeEditor ensures minimal changes at the output side and guarantees safety. (b) Humans perceive unsafe content in a post-hoc way and suggests modifications to the image
  • Figure 2: The data synthesis pipeline of MR-SafeEdit
  • Figure 3: Statistics of the MR-SafeEdit dataset
  • Figure 4: The multi-round inference procedure of SafeEditor
  • Figure 5: (a) Unsafety ratio ($\downarrow$ lower is safer). SafetyEditor consistently improves safety across different models.. (b) CLIP score ($\uparrow$ higher is better). SafetyEditor preserves image–text alignment (utility) across models and datasets.
  • ...and 18 more figures