Unified Concept Editing in Diffusion Models
Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau
TL;DR
This work tackles the simultaneous safety concerns of copyright adherence, offensive content, and social biases in text-to-image diffusion models by introducing Unified Concept Editing (UCE). UCE provides a closed-form cross-attention weight update that can apply hundreds of edits in one pass, generalizing prior techniques such as TIME and MEMIT to diffusion models. Edits are categorized as erasing, debiasing, or moderation, each implemented via a unified objective that preserves unedited concepts through targeted preservation terms, with an explicit update formula $W = ig( extstyle\sum_{c_i\in E} v_i^* c_i^T + extstyle\sum_{c_j\in P} W^{old} c_j c_j^T ig) ig( extstyle\sum_{c_i\in E} c_i c_i^T + extstyle\sum_{c_j\in P} c_j c_j^T ig)^{-1}$. Experiments demonstrate effective artistic style erasure, multi-attribute debiasing (gender and race), and NSFW moderation with reduced interference on non-target concepts, supporting scalable, post-training safety editing for real-world deployment.
Abstract
Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. We present a method that tackles all issues with a single approach. Our method, Unified Concept Editing (UCE), edits the model without training using a closed-form solution, and scales seamlessly to concurrent edits on text-conditional diffusion models. We demonstrate scalable simultaneous debiasing, style erasure, and content moderation by editing text-to-image projections, and we present extensive experiments demonstrating improved efficacy and scalability over prior work. Our code is available at https://unified.baulab.info
