Table of Contents
Fetching ...

A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

Changyu Liu, James Chenhao Liang, Wenhao Yang, Yiming Cui, Jinghao Yang, Tianyang Wang, Qifan Wang, Dongfang Liu, Cheng Han

Abstract

Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT's most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.

A-SelecT: Automatic Timestep Selection for Diffusion Transformer Representation Learning

Abstract

Diffusion models have significantly reshaped the field of generative artificial intelligence and are now increasingly explored for their capacity in discriminative representation learning. Diffusion Transformer (DiT) has recently gained attention as a promising alternative to conventional U-Net-based diffusion models, demonstrating a promising avenue for downstream discriminative tasks via generative pre-training. However, its current training efficiency and representational capacity remain largely constrained due to the inadequate timestep searching and insufficient exploitation of DiT-specific feature representations. In light of this view, we introduce Automatically Selected Timestep (A-SelecT) that dynamically pinpoints DiT's most information-rich timestep from the selected transformer feature in a single run, eliminating the need for both computationally intensive exhaustive timestep searching and suboptimal discriminative feature selection. Extensive experiments on classification and segmentation benchmarks demonstrate that DiT, empowered by A-SelecT, surpasses all prior diffusion-based attempts efficiently and effectively.

Paper Structure

This paper contains 38 sections, 8 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: A Preliminary Study on the impact of the High-Frequency Ratio (HFR) with classification performance on Oxford Flowers (a) and CUB (b). The green curve represents the HFR values, and the red curve are classification accuracies. We have two key observations: I. HFR values exhibit a positive correlation with classification accuracies. II. The highest classification accuracy is achieved when the HFR value reaches its maximum. More results in Appendix § S2
  • Figure 2: Overview of Automatically Selected Timestep (A-SelecT). A-SelecT begins by simulating $\text{sample}_{t}$ at timestep $t$ (see Eq. \ref{['eq:forward']}). This $\text{sample}_{t}$ is then processed through the diffusion backbone to extract the query feature $Q_t$ at each timestep $t$. Upon obtaining all $Q_t$, their HFR are calculated. The timestep exhibiting the highest average HFR is subsequently selected for feature extraction. The query feature extracted at this optimal timestep $\hat{t}$ is then fed into targeted discriminative tasks.
  • Figure 3: Visualizations of High-frequency $vs.$ Low-frequency Information. We present a decomposition of the original features extracted from SD 3.5 into components that exclusively contain high-frequency and low-frequency information. The second column is the original features extracted from the model. As seen, the high-frequency features are shown to contain more discriminative information (, edge, texture, corner information from the black footed albatross can be clearly preserved) than their low-frequency counterparts. Inspired by this, we design HFR to assess the significance of high-frequency information (see §\ref{['sec:HFR']}).
  • Figure 4: Comparison of HFR and Fisher Score across time- steps on Oxford Flowers (top) and CUB (bottom). They show strong alignment, indicating that HFR captures discriminative characteristics consistent with the Fisher Score and serves as a reliable label-free indicator of feature separability.
  • Figure 5: Impact of Feature and Block Selection. We present accuracy across the features $Q$, $K$, $V$, $A$, and $O$ extracted from different transformer blocks on Oxford Flowers. $Q$ and $V$ achieve the highest accuracies (90.6% and 88.7%, respectively), while $A$, $O$, and $K$ show comparatively lower performance. The middle transformer blocks yield the most discriminative representations, highlighting the importance of both feature and block selection for optimal performance. Additional experimental results on other datasets are provided in the Appendix § S7.
  • ...and 8 more figures