Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

Yongwoo Kim; Sungmin Cha; Hyunsoo Kim; Jaewon Lee; Donghyun Kim

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

Yongwoo Kim, Sungmin Cha, Hyunsoo Kim, Jaewon Lee, Donghyun Kim

TL;DR

The paper addresses the problem of erasing undesired concepts in text-to-image diffusion while preserving semantic structure. It introduces PAIRed Erasing (PAIR), which reframes erasure as consistency-preserving semantic realignment using multimodal unsafe–safe pairs, replacing naive null-space negation with anchored safe counterparts. The approach comprises Paired Semantic Realignment Loss, which explicitly maps unsafe concepts to safe anchors via paired data and visual conditioning, and FiDoRA, a Fisher-information–guided initialization for DoRA that constrains directional weight updates to maintain overall semantic integrity. Across Nudity Removal, Artistic Style Removal, and Object Removal tasks, PAIR demonstrates superior erasure efficacy, generation quality, and consistency, supported by extensive quantitative and human evaluation, validating its potential for safer deployment of diffusion models.

Abstract

With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 6 equations, 12 figures, 8 tables, 2 algorithms.

Introduction
Related Work
Preliminaries
Methodology
Constructing Paired Datasets for Targeted Erasing
Paired Semantic Realignment Loss
FiDoRA: Fisher-Weighted Initialization for DoRA for Consistency-Preserving Concept Erasure
Motivation: Directional Sensitivity in Concept Erasure
Experiments
Overall Performance
Ablation Study
Conclusion
Impact Statements
Experimental Settings
Training Details
...and 7 more sections

Figures (12)

Figure 1: While existing concept erasure methods either incompletely remove target concepts or change semantic content, our approach achieves surgical erasure by isolating and replacing only the target attributes, maintaining structural consistency and fine-grained details.
Figure 2: Overview of the proposed PAIRed Erasing (PAIR) pipeline. (a) Construction of unsafe–safe pairs. Unsafe images are first generated using forget (target) prompts and filtered by a classifier, then edited to obtain semantically aligned safe counterparts, forming paired forget data $D_f$ and retain data $D_r$. (b) Given paired conditions, the T2I model is optimized to realign unsafe generations toward their safe counterparts using paired images and captions, while preserving consistency. To preserve consistency during fine-tuning, we adopt Fisher-weighted Initialization for DoRA (FiDoRA), which enables consistency-preserving weight updates by decomposing weight directions and magnitudes (c), and initializing the decomposed parameters using unsafe–safe pairs (d).
Figure 3: Directional sensitivity over Consistency. Red line represents directional changes ($\Delta$).
Figure 4: Qualitative comparison of baselines on nudity removal.
Figure 5: Win rate (%) comparison across human evaluation (left) and MLLM-based judgment (right). Our method significantly outperforms baselines.
...and 7 more figures

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

TL;DR

Abstract

Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)