Table of Contents
Fetching ...

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

Kairan Zhao, Eleni Triantafillou, Peter Triantafillou

TL;DR

The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

Abstract

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

You Don't Need All That Attention: Surgical Memorization Mitigation in Text-to-Image Diffusion Models

TL;DR

The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.

Abstract

Generative models have been shown to "memorize" certain training data, leading to verbatim or near-verbatim generating images, which may cause privacy concerns or copyright infringement. We introduce Guidance Using Attractive-Repulsive Dynamics (GUARD), a novel framework for memorization mitigation in text-to-image diffusion models. GUARD adjusts the image denoising process to guide the generation away from an original training image and towards one that is distinct from training data while remaining aligned with the prompt, guarding against reproducing training data, without hurting image generation quality. We propose a concrete instantiation of this framework, where the positive target that we steer towards is given by a novel method for (cross) attention attenuation based on (i) a novel statistical mechanism that automatically identifies the prompt positions where cross attention must be attenuated and (ii) attenuating cross-attention in these per-prompt locations. The resulting GUARD offers a surgical, dynamic per-prompt inference-time approach that, we find, is by far the most robust method in terms of consistently producing state-of-the-art results for memorization mitigation across two architectures and for both verbatim and template memorization, while also improving upon or yielding comparable results in terms of image quality.
Paper Structure (23 sections, 8 equations, 11 figures, 5 tables)

This paper contains 23 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: CA patterns. (a) and (b): the CA mass across tokens, in the first and last inference steps for verbatim and template memorization (VM and TM). For clarity, we exclude the first token (position 0) from the plots, as it consistently receives the majority of CA across both memorized and non-memorized examples, which would dominate the scale and obscure differences among the remaining tokens. (c) and (d): the CA mass on EOT across steps, for a memorized and a non-memorized prompt. (c): VM on SD v1.4. (d): TM, on SD v2.0.
  • Figure 2: The best achievable SSCD, CLIP, and FID of different methods. We plot the best value a method can achieve on each metric individually, using the configuration that yields best results on that specific metric. In each subplot, a different configuration may be used (we pick the hyperparameter setting that yields the best SSCD, best CLIP and best FID, respectively), so this plot does not speak to the ability of a method to do well on all metrics jointly, nor to trade-offs between these metrics, which we investigate separately.
  • Figure 3: Comparison of the SSCD of different methods for the hyperparameter configuration that works best for CLIP.
  • Figure 4: Overview of the CA-in-GUARD denoising process.
  • Figure 5: The best achievable SSCD, CLIP, and FID of CA-in-GUARD (our default) versus semantic-in-GUARD (ablation), evaluated across 3 settings: SD v1.4 with verbatim memorization, template memorization, and SD v2.0 with template memorization.
  • ...and 6 more figures