Consistency-Preserving Concept Erasure via Unsafe-Safe Pairing and Directional Fisher-weighted Adaptation
Yongwoo Kim, Sungmin Cha, Hyunsoo Kim, Jaewon Lee, Donghyun Kim
TL;DR
The paper addresses the problem of erasing undesired concepts in text-to-image diffusion while preserving semantic structure. It introduces PAIRed Erasing (PAIR), which reframes erasure as consistency-preserving semantic realignment using multimodal unsafe–safe pairs, replacing naive null-space negation with anchored safe counterparts. The approach comprises Paired Semantic Realignment Loss, which explicitly maps unsafe concepts to safe anchors via paired data and visual conditioning, and FiDoRA, a Fisher-information–guided initialization for DoRA that constrains directional weight updates to maintain overall semantic integrity. Across Nudity Removal, Artistic Style Removal, and Object Removal tasks, PAIR demonstrates superior erasure efficacy, generation quality, and consistency, supported by extensive quantitative and human evaluation, validating its potential for safer deployment of diffusion models.
Abstract
With the increasing versatility of text-to-image diffusion models, the ability to selectively erase undesirable concepts (e.g., harmful content) has become indispensable. However, existing concept erasure approaches primarily focus on removing unsafe concepts without providing guidance toward corresponding safe alternatives, which often leads to failure in preserving the structural and semantic consistency between the original and erased generations. In this paper, we propose a novel framework, PAIRed Erasing (PAIR), which reframes concept erasure from simple removal to consistency-preserving semantic realignment using unsafe-safe pairs. We first generate safe counterparts from unsafe inputs while preserving structural and semantic fidelity, forming paired unsafe-safe multimodal data. Leveraging these pairs, we introduce two key components: (1) Paired Semantic Realignment, a guided objective that uses unsafe-safe pairs to explicitly map target concepts to semantically aligned safe anchors; and (2) Fisher-weighted Initialization for DoRA, which initializes parameter-efficient low-rank adaptation matrices using unsafe-safe pairs, encouraging the generation of safe alternatives while selectively suppressing unsafe concepts. Together, these components enable fine-grained erasure that removes only the targeted concepts while maintaining overall semantic consistency. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving effective concept erasure while preserving structural integrity, semantic coherence, and generation quality.
