Table of Contents
Fetching ...

Pyramidal Patchification Flow for Visual Generation

Hui Li, Baoyou Chen, Liwei Zhang, Jiaye Li, Jingdong Wang, Siyu Zhu

TL;DR

This work introduces a Pyramidal Patchification Flow approach, which operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick.

Abstract

Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a $1.6\times$ ($2.0\times$) inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.

Pyramidal Patchification Flow for Visual Generation

TL;DR

This work introduces a Pyramidal Patchification Flow approach, which operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick.

Abstract

Diffusion transformers (DiTs) adopt Patchify, mapping patch representations to token representations through linear projections, to adjust the number of tokens input to DiT blocks and thus the computation cost. Instead of a single patch size for all the timesteps, we introduce a Pyramidal Patchification Flow (PPFlow) approach: Large patch sizes are used for high noise timesteps and small patch sizes for low noise timesteps; Linear projections are learned for each patch size; and Unpatchify is accordingly modified. Unlike Pyramidal Flow, our approach operates over full latent representations other than pyramid representations, and adopts the normal denoising process without requiring the renoising trick. We demonstrate the effectiveness of our approach through two training manners. Training from scratch achieves a () inference speed over SiT-B/2 for 2-level (3-level) pyramid patchification with slightly lower training FLOPs and similar image generation performance. Training from pretrained normal DiTs achieves even better performance with small training time. The code and checkpoint are at https://github.com/fudan-generative-vision/PPFlow.

Paper Structure

This paper contains 16 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Pyramidal Patchification Flow (PPFlow) achieves state-of-the-art image generation quality with accelerated denoising processes. (a) and (b) show visual samples from two of our class-conditional PPF-XL-2 and PPF-XL-3 trained on ImageNet. (c) indicates that PPF-XL-2 and PPF-XL-3 obtains 1.6 $\times$ and 2.0 $\times$ inference acceleration with comparable FID scores.
  • Figure 2: Conceptual comparison. (a) A three-level PPFlow example. The patch sizes in $\operatorname{Patchify}$ are larger for higher-noise timesteps and smaller for lower-noise timesteps. The representation resolutions for all the three levels are the same and full. (b) Pyramidal Flow jin2024pyramidalflow. We illustrate it for image generation. It operates over pyramid representations: smaller representation resolution for higher noise and larger representation resolution for lower noise.
  • Figure 3: $\operatorname{Patchify}$. After flattening a noisy latent, the layer maps the $p_s \times p_s$ patch representation into a $d$-dimensional token representation through a linear projection $\mathbf{W}_s \in \mathbb{R}^{d \times d_s}$ ($d_s = Cp_s^2$). $\operatorname{Unpatchify}$ is a reverse process, mapping the token representation, output from DiT blocks to the predictions. For example, for the velocity predictions, the linear projection matrix is of size $d_s \times d$: $\mathbf{W}^u_s \in \mathbb{R}^{d_s \times d}$.
  • Figure 4: Comparison of our approach to FlexiDiT based on DiT-XL/2 of 256 resolution. At around 63% FLOPs budget, PPFlow achieves an FID of 2.15, outperforming FlexiDiT's 2.25. Further, at around 50% FLOPs, PPFlow's FID of 2.31 is substantially better than 2.64 FID of FlexiDiT.
  • Figure 5: Visualization results for normal SiT-B/2, PPF-B-2, and PPF-B-3. The results are sampled from the same noise. The models of our approach are trained from scratch. The results of the three methods are visually comparable.
  • ...and 4 more figures