Table of Contents
Fetching ...

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin

TL;DR

This work identifies massive activations in Diffusion Transformers (DiTs) as a key barrier to reliable dense visual correspondence. It introduces Diffusion Transformer Feature (DiTF), a training-free AdaLN-based framework that uses channel-wise modulation and a channel discard strategy to suppress these activations and extract semantically discriminative features, conditioned on timestep $t$ and text input $c$. Empirically, DiTF achieves state-of-the-art performance on semantic correspondence benchmarks (e.g., substantial gains over SD-based and DINOv2 baselines) and demonstrates robust cross-category results and transfer to semantic segmentation. The findings show that leveraging AdaLN within DiTs can unlock strong, semantically meaningful representations for a range of perceptual tasks without additional training.

Abstract

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

TL;DR

This work identifies massive activations in Diffusion Transformers (DiTs) as a key barrier to reliable dense visual correspondence. It introduces Diffusion Transformer Feature (DiTF), a training-free AdaLN-based framework that uses channel-wise modulation and a channel discard strategy to suppress these activations and extract semantically discriminative features, conditioned on timestep and text input . Empirically, DiTF achieves state-of-the-art performance on semantic correspondence benchmarks (e.g., substantial gains over SD-based and DINOv2 baselines) and demonstrates robust cross-category results and transfer to semantic segmentation. The findings show that leveraging AdaLN within DiTs can unlock strong, semantically meaningful representations for a range of perceptual tasks without additional training.

Abstract

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).

Paper Structure

This paper contains 31 sections, 9 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: AdaLN enhances DiT features by mitigating massive activations. (a) Original DiT features show concentrated massive activations. (b) Semantic correspondence performance using different features. Original DiT features yield poor performance due to the presence of massive activations. By modulating these activations, AdaLN significantly boosts correspondence performance.
  • Figure 2: Massive activations in Diffusion Transformers (DiTs). We visualize the activation magnitudes (z-axis) of features across various diffusion models. Unlike the Stable Diffusion (SD2-1) model, all DiTs exhibit a distinctive phenomenon where very few feature activations show significantly higher activation values, more than 100 times larger than others. We refer to this phenomenon as massive activations, which has also been observed in large language models (LLMs) sun2024massive.
  • Figure 3: Massive Activations in SD3-5. We visualize the activation magnitudes of four different image features extracted using SD3-5. Notably, massive activations consistently distribute in a fixed dimension (676) across all image patch tokens.
  • Figure 4: Massive activations dimensions align with the residual scaling factor $\alpha_k$. We visualize the magnitudes for the original feature $z_k^{t+1}$ and residual scaling factor $\alpha_k$.
  • Figure 4: Ablation study of our model ${\text{DiTF}_{\texttt{flux}}}$ on the dataset SPair-71k and AP-10K. We report the PCK per point for SPair-71k and PCK per image for AP-10K-I.S.
  • ...and 15 more figures