Table of Contents
Fetching ...

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

Mansi, Avinash Kori, Francesca Toni, Soteris Demetriou

TL;DR

This work tackles unsafe content generation in text-to-image diffusion by introducing TRuST, a dynamic, neuron-level unlearning framework. It identifies target concept neurons via cross-attention saliency, and enforces unlearning through two objectives: CIP for hard, sparsity-driven removal and CSR for soft, sensitivity-reducing refinement, all within a dynamic mask-guided finetuning loop. TRuST demonstrates robust performance against adversarial prompts, preserves non-targeted concept quality, and excels at erasing concept combinations and conditional associations with substantially improved efficiency. The approach is architecture-agnostic and adaptable to broader generative models, offering a practical path toward safer, controllable image synthesis.

Abstract

Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.

Selective Fine-Tuning for Targeted and Robust Concept Unlearning

TL;DR

This work tackles unsafe content generation in text-to-image diffusion by introducing TRuST, a dynamic, neuron-level unlearning framework. It identifies target concept neurons via cross-attention saliency, and enforces unlearning through two objectives: CIP for hard, sparsity-driven removal and CSR for soft, sensitivity-reducing refinement, all within a dynamic mask-guided finetuning loop. TRuST demonstrates robust performance against adversarial prompts, preserves non-targeted concept quality, and excels at erasing concept combinations and conditional associations with substantially improved efficiency. The approach is architecture-agnostic and adaptable to broader generative models, offering a practical path toward safer, controllable image synthesis.

Abstract

Text guided diffusion models are used by millions of users, but can be easily exploited to produce harmful content. Concept unlearning methods aim at reducing the models' likelihood of generating harmful content. Traditionally, this has been tackled at an individual concept level, with only a handful of recent works considering more realistic concept combinations. However, state of the art methods depend on full finetuning, which is computationally expensive. Concept localisation methods can facilitate selective finetuning, but existing techniques are static, resulting in suboptimal utility. In order to tackle these challenges, we propose TRUST (Targeted Robust Selective fine Tuning), a novel approach for dynamically estimating target concept neurons and unlearning them through selective finetuning, empowered by a Hessian based regularization. We show experimentally, against a number of SOTA baselines, that TRUST is robust against adversarial prompts, preserves generation quality to a significant degree, and is also significantly faster than the SOTA. Our method achieves unlearning of not only individual concepts but also combinations of concepts and conditional concepts, without any specific regularization.
Paper Structure (40 sections, 12 equations, 20 figures, 9 tables, 2 algorithms)

This paper contains 40 sections, 12 equations, 20 figures, 9 tables, 2 algorithms.

Figures (20)

  • Figure 1: Challenges in MU for image generation. We compare CoGFD, SalUn and SalUn++ which is a stronger version of SalUn where the saliency map is recomputed after each finetuning step. Stable Diffusion 1.5 is considered as a reference for computation of $\Delta$ CLIP and $\Delta$ FID.
  • Figure 2: Overview of TRuST: The pipeline (left), depicts the concept neurons discovery and selective finetuning with both CSR and CIP. The right half showcases TRuST's ability to unlearn both concept combinations and conditional concepts, along with comparisons against well established concept erasure methods for "Nudity" unlearning against adversarial prompts (P4D 10.5555/3692070.3692406). Sections of image with "*" have been intentionally hidden for safety purposes.
  • Figure 3: Average CLIP scores and number of finetuning steps. TRuST achieves better concept combination erasure (lower scores for targeted combinations) while better preserving individual concepts (higher scores for the individual concepts), in less steps on average compared to the SOTA.
  • Figure 4: TIFA comparison for the conditional prompt “Cat on the table”.
  • Figure 5: Example of robustness and fidelity of our method in comparison to existing works on the I2P dataset. Please refer to Table \ref{['tab:i2p_prompts']} for the respective prompts.
  • ...and 15 more figures