Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations
Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin
TL;DR
This work identifies massive activations in Diffusion Transformers (DiTs) as a key barrier to reliable dense visual correspondence. It introduces Diffusion Transformer Feature (DiTF), a training-free AdaLN-based framework that uses channel-wise modulation and a channel discard strategy to suppress these activations and extract semantically discriminative features, conditioned on timestep $t$ and text input $c$. Empirically, DiTF achieves state-of-the-art performance on semantic correspondence benchmarks (e.g., substantial gains over SD-based and DINOv2 baselines) and demonstrates robust cross-category results and transfer to semantic segmentation. The findings show that leveraging AdaLN within DiTs can unlock strong, semantically meaningful representations for a range of perceptual tasks without additional training.
Abstract
Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We analyze these dimension-concentrated massive activations and uncover that their concentration is inherently linked to the Adaptive Layer Normalization (AdaLN) in DiTs. Building on these findings, we propose the \textbf{Di}ffusion \textbf{T}ransformer \textbf{F}eature (DiTF), a training-free AdaLN-based framework that extracts semantically discriminative features from DiTs. Specifically, DiTF leverages AdaLN to adaptively localize and normalize massive activations through channel-wise modulation. Furthermore, a channel discard strategy is introduced to mitigate the adverse effects of massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
