Table of Contents
Fetching ...

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts

Anh Bui, Khanh Doan, Trung Le, Paul Montague, Tamas Abraham, Dinh Phung

TL;DR

This work tackles the problem of undesirable concepts learned by text-to-image diffusion models by proposing a reversible hiding mechanism that uses a learnable cross-attention prompt to suppress unwanted content. The Knowledge Hiding and Recovery with Prompt (KPOP) framework alternates between knowledge recovery/transfer and knowledge hiding/removal, formalizing objectives with $L_1$ erasure and $L_2$ recovery terms and a secret-key prompt $oldsymbol{p}_{c_e}$ to enable recovery. Empirical results on Stable Diffusion across object, unethical content, and artistic style erasure tasks show strong erasure with high preserving performance and effective recovery when the secret key is provided, outperforming several baselines in key metrics. The approach offers flexible, access-controlled content moderation and contributes a new perspective on safeguarding T2I models while enabling controlled reintegration of restricted concepts for auditing and security analyses.

Abstract

Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.

Hiding and Recovering Knowledge in Text-to-Image Diffusion Models via Learnable Prompts

TL;DR

This work tackles the problem of undesirable concepts learned by text-to-image diffusion models by proposing a reversible hiding mechanism that uses a learnable cross-attention prompt to suppress unwanted content. The Knowledge Hiding and Recovery with Prompt (KPOP) framework alternates between knowledge recovery/transfer and knowledge hiding/removal, formalizing objectives with erasure and recovery terms and a secret-key prompt to enable recovery. Empirical results on Stable Diffusion across object, unethical content, and artistic style erasure tasks show strong erasure with high preserving performance and effective recovery when the secret key is provided, outperforming several baselines in key metrics. The approach offers flexible, access-controlled content moderation and contributes a new perspective on safeguarding T2I models while enabling controlled reintegration of restricted concepts for auditing and security analyses.

Abstract

Diffusion models have demonstrated remarkable capability in generating high-quality visual content from textual descriptions. However, since these models are trained on large-scale internet data, they inevitably learn undesirable concepts, such as sensitive content, copyrighted material, and harmful or unethical elements. While previous works focus on permanently removing such concepts, this approach is often impractical, as it can degrade model performance and lead to irreversible loss of information. In this work, we introduce a novel concept-hiding approach that makes unwanted concepts inaccessible to public users while allowing controlled recovery when needed. Instead of erasing knowledge from the model entirely, we incorporate a learnable prompt into the cross-attention module, acting as a secure memory that suppresses the generation of hidden concepts unless a secret key is provided. This enables flexible access control -- ensuring that undesirable content cannot be easily generated while preserving the option to reinstate it under restricted conditions. Our method introduces a new paradigm where concept suppression and controlled recovery coexist, which was not feasible in prior works. We validate its effectiveness on the Stable Diffusion model, demonstrating that hiding concepts mitigate the risks of permanent removal while maintaining the model's overall capability.
Paper Structure (26 sections, 2 equations, 13 figures, 7 tables)

This paper contains 26 sections, 2 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Qualitative results. (1st column) Original SD model. (2nd column) Sanitized by ESD. (3rd column) Sanitized by our method. (4th column) Recovered by our method using the secret key. Several failure cases in ESD demonstrate incomplete erasure, whereas our approach effectively removes the target concepts while enabling precise recovery with the secret key.
  • Figure 2: Attentive attribution maps between the visual and textual concepts in the original SD model and our method.
  • Figure 3: Comparison of the erasing performance on the I2P dataset. \ref{['fig:exposed_body_parts_stacked']}: Number of exposed body parts counted in all generated images with threshold 0.5. \ref{['fig:exposed_nudity']}: Ratio of images with any exposed body parts detected by the detector nudenet2019.
  • Figure 4: Prompt's learning process (\ref{['fig:cosine_vs_word']}) and the cosine similarity between visual and textual features in our method (\ref{['fig:cosine_vs_image']}) and ESD (\ref{['fig:cosine_vs_image_esd']}), respectively.
  • Figure 5: quantitative results of artistic style erasure.
  • ...and 8 more figures