Table of Contents
Fetching ...

Dynamic Chunking Diffusion Transformer

Akash Haridas, Utkarsh Saxena, Parsa Ashrafi Fashi, Mehdi Rezagholizadeh, Vikram Appia, Emad Barsoum

TL;DR

The Dynamic Chunking Diffusion Transformer (DC-DiT) is introduced, which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training.

Abstract

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static $\textit{patchify}$ operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet $256{\times}256$, DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across $4{\times}$ and $16{\times}$ compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to $8{\times}$ fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.

Dynamic Chunking Diffusion Transformer

TL;DR

The Dynamic Chunking Diffusion Transformer (DC-DiT) is introduced, which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training.

Abstract

Diffusion Transformers process images as fixed-length sequences of tokens produced by a static operation. While effective, this design spends uniform compute on low- and high-information regions alike, ignoring that images contain regions of varying detail and that the denoising process progresses from coarse structure at early timesteps to fine detail at late timesteps. We introduce the Dynamic Chunking Diffusion Transformer (DC-DiT), which augments the DiT backbone with a learned encoder-router-decoder scaffold that adaptively compresses the 2D input into a shorter token sequence in a data-dependent manner using a chunking mechanism learned end-to-end with diffusion training. The mechanism learns to compress uniform background regions into fewer tokens and detail-rich regions into more tokens, with meaningful visual segmentations emerging without explicit supervision. Furthermore, it also learns to adapt its compression across diffusion timesteps, using fewer tokens at noisy stages and more tokens as fine details emerge. On class-conditional ImageNet , DC-DiT consistently improves FID and Inception Score over both parameter-matched and FLOP-matched DiT baselines across and compression, showing this is a promising technique with potential further applications to pixel-space, video and 3D generation. Beyond accuracy, DC-DiT is practical: it can be upcycled from pretrained DiT checkpoints with minimal post-training compute (up to fewer training steps) and composes with other dynamic computation methods to further reduce generation FLOPs.
Paper Structure (28 sections, 3 equations, 4 figures, 4 tables)

This paper contains 28 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Architecture of DC-DiT. The isotropic encoder aggregates local context across the input tokens. The chunking layer selects a subset of boundary tokens via a learned routing module, yielding a compressed sequence that is processed by the DiT blocks. The de-chunking layer restores the original resolution through spatial smoothing followed by plug-back.
  • Figure 2: Boundary predictions shown next to sample images from the XL-scale DC-DiT at $N{=}4$ (top) and $N{=}16$ (bottom). Boundary tokens (retained) concentrate on object edges and textured regions, while non-boundary tokens (dropped) cluster in uniform backgrounds. The chunking mechanism discovers these visual segmentations without any explicit supervision, solely from being trained with the diffusion objective.
  • Figure 3: FID-50K as a function of training steps across model scales and compression ratios. DC-DiT achieves similar scores as the isoparam baselines with 25-50% fewer training steps. At XL scale with ${\sim}4{\times}$ compression, DC-DiT starts with higher FID but exhibits faster convergence, surpassing both baselines by 400K steps.
  • Figure 4: Compression ratio and inference throughput as a function of diffusion timestep for the XL-scale DC-DiT. At early (noisy) timesteps the router retains fewer boundary tokens, yielding higher compression and faster throughput. As denoising progresses and fine details emerge, the router retains more tokens. This schedule emerges entirely from end-to-end training without any explicit timestep-dependent supervision.