Table of Contents
Fetching ...

Attention Shift: Steering AI Away from Unsafe Content

Shivank Garg, Manyana Tiwari

TL;DR

A novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference is introduced, comparing the method against existing ablation methods.

Abstract

This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.

Attention Shift: Steering AI Away from Unsafe Content

TL;DR

A novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference is introduced, comparing the method against existing ablation methods.

Abstract

This study investigates the generation of unsafe or harmful content in state-of-the-art generative models, focusing on methods for restricting such generations. We introduce a novel training-free approach using attention reweighing to remove unsafe concepts without additional training during inference. We compare our method against existing ablation methods, evaluating the performance on both, direct and adversarial jailbreak prompts, using qualitative and quantitative metrics. We hypothesize potential reasons for the observed results and discuss the limitations and broader implications of content restriction.
Paper Structure (20 sections, 5 figures, 3 tables)

This paper contains 20 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: For comparison, we generate an image from the original unsafe prompt and use our method to obtain a safe image. We observe a vast difference in the extent to which explicit output is restricted.
  • Figure 2: We first replace the unsafe tokens with the modified safe tokens to obtain $\mathbf{M}_t$ to ${\mathbf{M}}^{*}_t$, then add new attention maps to account for additional words in the new safe prompt. We reweigh these modified maps to emphasize the central safe concept of the image while ensuring efficient image editing. We use the final modified and reweighed cross-attention maps $\hat{\mathbf{M}}^{*}_t$ for the denoising process.
  • Figure 3: Image editing as we give higher attention score to the attention map corresponding to "women"
  • Figure 4: Image editing as we give higher attention score to the attention map corresponding to "clothed"
  • Figure 5: Visual ablation results on various state-of-the-art models. Rows represent different types of unsafe content: (1) Violence, (2) Violence (Jailbreak), (3) Nudity, (4) Nudity (Jailbreak). Columns correspond to different ablation techniques: (1) Baseline, (2) Concept Ablation, (3) Forget-Me-Not, (4) Safe Diffusion, (5) Fine-Tuned Diffusion Model, (6) SPM, (7) P2P (Ours), (8) Image produced by the Diffusion Model using a new prompt.