Table of Contents
Fetching ...

TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Fu-Yun Wang, Yuchi Wang, Renrui Zhang, Peng Gao, Hongsheng Li

TL;DR

TIDE introduces Temporal-Aware Sparse Autoencoders to extract interpretable, sparse activations from Diffusion Transformers across diffusion timesteps, revealing that DiTs organize hierarchical 3D, semantic, and class-level features during large-scale pretraining. By training SAEs on DiT activations and adding timestep-dependent modulation, TIDE achieves improved reconstruction and interpretability with minimal sacrifice to generation quality. The approach demonstrates robustness across backbones and enables practical applications such as safe image editing and style transfer, supported by ablations and safety evaluations. Overall, TIDE provides a foundation for trustworthy, controllable diffusion-based generation by making internal representations transparent and manipulable.

Abstract

Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs-a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.

TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation

TL;DR

TIDE introduces Temporal-Aware Sparse Autoencoders to extract interpretable, sparse activations from Diffusion Transformers across diffusion timesteps, revealing that DiTs organize hierarchical 3D, semantic, and class-level features during large-scale pretraining. By training SAEs on DiT activations and adding timestep-dependent modulation, TIDE achieves improved reconstruction and interpretability with minimal sacrifice to generation quality. The approach demonstrates robustness across backbones and enables practical applications such as safe image editing and style transfer, supported by ablations and safety evaluations. Overall, TIDE provides a foundation for trustworthy, controllable diffusion-based generation by making internal representations transparent and manipulable.

Abstract

Diffusion Transformers (DiTs) are a powerful yet underexplored class of generative models compared to U-Net-based diffusion architectures. We propose TIDE-Temporal-aware sparse autoencoders for Interpretable Diffusion transformErs-a framework designed to extract sparse, interpretable activation features across timesteps in DiTs. TIDE effectively captures temporally-varying representations and reveals that DiTs naturally learn hierarchical semantics (e.g., 3D structure, object class, and fine-grained concepts) during large-scale pretraining. Experiments show that TIDE enhances interpretability and controllability while maintaining reasonable generation quality, enabling applications such as safe image editing and style transfer.

Paper Structure

This paper contains 24 sections, 5 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: TIDE effectively encodes interpretable features across different levels (class, semantic, and 3D) in pre-trained diffusion transformer chen2023pixart. This validates that the diffusion model inherently captures and organizes these multi-level concepts by large-scale generative pre-training, enabling it to perform various downstream diffusion tasks effectively(see Sec.Diffusion Really Learned Features for details).
  • Figure 2: Overview of our training process: For each Dit, we train its TIDE (Embed SAE within the TA (temporal-aware) architecture) individually, extracting activation layers from different timesteps each time. The training of the SAE is conducted using both with and without random sampling. For specific details on the loss design, refer to supporting material.
  • Figure 3: TIDE integrates timestep-dependent modulation into the original SAE architecture, achieving significantly faster convergence and enhanced performance.
  • Figure 4: (a)(b) Scaling laws of convergence loss with fixed latents n under MSE and cosine similarity loss ($L_{cos} = 1 - S_{cos}$). (c)(d) Comparison of TIDE and other activation functions: For 73728-d latents, TIDE achieves better trade-offs in diffusion loss against both cosine similarity and MSE.Based on these experiments, we selected top-k values in the range of 1024 to 4096 and set 16d = 73,728 as the parameters for subsequent image editing tasks.
  • Figure 5: By manipulating the latent space of the TIDE, we can achieve various concept transformations, such as erasing Rococo style, increasing age, and altering the orientation and shape of architecture. The extent of feature modification intensifies as the number of altered tokens increases.