Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts
Anh Bui, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung
TL;DR
This work tackles the problem of undesirable concepts learned by text-to-image diffusion models by proposing a reversible hiding mechanism that uses a learnable cross-attention prompt to suppress unwanted content. The Knowledge Hiding and Recovery with Prompt (KPOP) framework alternates between knowledge recovery/transfer and knowledge hiding/removal, formalizing objectives with $L_1$ erasure and $L_2$ recovery terms and a secret-key prompt $oldsymbol{p}_{c_e}$ to enable recovery. Empirical results on Stable Diffusion across object, unethical content, and artistic style erasure tasks show strong erasure with high preserving performance and effective recovery when the secret key is provided, outperforming several baselines in key metrics. The approach offers flexible, access-controlled content moderation and contributes a new perspective on safeguarding T2I models while enabling controlled reintegration of restricted concepts for auditing and security analyses.
Abstract
Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.
