Table of Contents
Fetching ...

Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

Benyuan Meng, Qianqian Xu, Zitai Wang, Zhiyong Yang, Xiaochun Cao, Qingming Huang

TL;DR

The cause of content shift is located as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature, and the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features.

Abstract

Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.

Suppress Content Shift: Better Diffusion Features via Off-the-Shelf Generation Techniques

TL;DR

The cause of content shift is located as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature, and the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features.

Abstract

Diffusion models are powerful generative models, and this capability can also be applied to discrimination. The inner activations of a pre-trained diffusion model can serve as features for discriminative tasks, namely, diffusion feature. We discover that diffusion feature has been hindered by a hidden yet universal phenomenon that we call content shift. To be specific, there are content differences between features and the input image, such as the exact shape of a certain object. We locate the cause of content shift as one inherent characteristic of diffusion models, which suggests the broad existence of this phenomenon in diffusion feature. Further empirical study also indicates that its negative impact is not negligible even when content shift is not visually perceivable. Hence, we propose to suppress content shift to enhance the overall quality of diffusion features. Specifically, content shift is related to the information drift during the process of recovering an image from the noisy input, pointing out the possibility of turning off-the-shelf generation techniques into tools for content shift suppression. We further propose a practical guideline named GATE to efficiently evaluate the potential benefit of a technique and provide an implementation of our methodology. Despite the simplicity, the proposed approach has achieved superior results on various tasks and datasets, validating its potential as a generic booster for diffusion features. Our code is available at https://github.com/Darkbblue/diffusion-content-shift.

Paper Structure

This paper contains 42 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Current diffusion features widely suffer from content shift, i.e., content differences between inputs and features. Due to the inherent connection, content shift can be suppressed with off-the-shelf generation techniques.
  • Figure 2: The overall process of feature extraction. The original image is first processed into UNet inputs via VAE and noise addition. Afterward, we collect the output activations of each resolution in the upsampling stage as convolutional features. At the same time, the cross-attention layers of UNet produce similarity maps, which are averaged over all upsampling layers as attention features.
  • Figure 3: The averaged results over three repeats with quality prompts. Horse-21 (high quality) and CIFAR10 (low quality) benefit from prompts closer to the image quality, suggesting the negative effect of content shift at small timesteps.
  • Figure 4: Content shift is caused by the reconstruction process within diffusion model activations. This visualization consists of activations obtained during a single network forward pass.
  • Figure 5: Overview of the GATE guideline and our implementation. GATE evaluates if a technique can suppress content shift based on the result of Img2Img generation. If a technique makes the result more similar to the input, it is considered to be potentially helpful. We further implement GATE by choosing three off-the-shelf generation techniques and amalgamating features obtained with different combinations of the techniques.
  • ...and 7 more figures