Table of Contents
Fetching ...

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Qingming Huang

TL;DR

Three properties universal among diffusion models are discovered, enabling this study to go beyond specific models, and effective feature selection solutions for several popular diffusion models are presented.

Abstract

Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.

Not All Diffusion Model Activations Have Been Evaluated as Discriminative Features

TL;DR

Three properties universal among diffusion models are discovered, enabling this study to go beyond specific models, and effective feature selection solutions for several popular diffusion models are presented.

Abstract

Diffusion models are initially designed for image generation. Recent research shows that the internal signals within their backbones, named activations, can also serve as dense features for various discriminative tasks such as semantic segmentation. Given numerous activations, selecting a small yet effective subset poses a fundamental problem. To this end, the early study of this field performs a large-scale quantitative comparison of the discriminative ability of the activations. However, we find that many potential activations have not been evaluated, such as the queries and keys used to compute attention scores. Moreover, recent advancements in diffusion architectures bring many new activations, such as those within embedded ViT modules. Both combined, activation selection remains unresolved but overlooked. To tackle this issue, this paper takes a further step with a much broader range of activations evaluated. Considering the significant increase in activations, a full-scale quantitative comparison is no longer operational. Instead, we seek to understand the properties of these activations, such that the activations that are clearly inferior can be filtered out in advance via simple qualitative evaluation. After careful analysis, we discover three properties universal among diffusion models, enabling this study to go beyond specific models. On top of this, we present effective feature selection solutions for several popular diffusion models. Finally, the experiments across multiple discriminative tasks validate the superiority of our method over the SOTA competitors. Our code is available at https://github.com/Darkbblue/generic-diffusion-feature.
Paper Structure (38 sections, 18 figures, 6 tables)

This paper contains 38 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Prior arts only consider a small fraction of potential activations in diffusion models. As a result, more advanced diffusion architecture fails to achieve better performance (SDXL v.s. SDv1.5). In contrast, we consider a broader range of candidate activations. To facilitate the quantitative comparison, we first make a comprehensive and generalizable analysis to qualitatively filter out many candidates in advance. On top of this, our method achieves superior performance (75.2 PCK@0.1).
  • Figure 2: U-Net architecture (upper) and the ViT module (lower), taking SDXL as an example.
  • Figure 3: We highlight three properties of diffusion U-Nets that are distinct from existing knowledge about other models: (a) Asymmetric diffusion noises. (b) In-resolution granularity changes. (c) Locality without positional embeddings: pixels within the orange circle resemble nearby background pixels more than distant pixels on the horse's neck that are semantically closer.
  • Figure 4: (a) Diffusion noises result in a significant performance degeneration ( Resolution#1). (b) Locality degrades the quality of self-attention activations (Block#0 and Block#5). (c) Locality in self-attention activations can suppress diffusion noises, leading to better quality than noisy activations (41.41 v.s. 34.58). All $\text{PCK@0.1}_{\text{img}}(\uparrow)$ results are evaluated on the semantic correspondence task.
  • Figure 5: Visualization of SDXL activations on a simple outdoor scene.
  • ...and 13 more figures