Table of Contents
Fetching ...

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld

TL;DR

FlexiDiT tackles the high compute cost of diffusion transformers by making inference per step dynamic. It introduces a lightweight, patch size based adaptation that lets a single pre trained DiT operate as multiple sequences with different token counts, achieved via shared parameters or low rank adapters and a minimal set of architectural adjustments. An inference scheduler that employs a weak model for early denoising steps and a powerful model for later steps yields compute reductions of over $40\%$ for images and up to $75\%$ for video, with negligible loss in quality across class conditioned, text conditioned, and video generation tasks. The approach is modality agnostic, enabling broad applicability to multimodal diffusion systems and offering a practical path to scalable, high quality generation.

Abstract

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

TL;DR

FlexiDiT tackles the high compute cost of diffusion transformers by making inference per step dynamic. It introduces a lightweight, patch size based adaptation that lets a single pre trained DiT operate as multiple sequences with different token counts, achieved via shared parameters or low rank adapters and a minimal set of architectural adjustments. An inference scheduler that employs a weak model for early denoising steps and a powerful model for later steps yields compute reductions of over for images and up to for video, with negligible loss in quality across class conditioned, text conditioned, and video generation tasks. The approach is modality agnostic, enabling broad applicability to multimodal diffusion systems and offering a practical path to scalable, high quality generation.

Abstract

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than \% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to \% less compute without compromising performance.

Paper Structure

This paper contains 51 sections, 6 equations, 45 figures, 5 tables.

Figures (45)

  • Figure 1: We flexify DiTs and adjust the compute per diffusion step, generating high-quality samples with significantly less compute.
  • Figure 2: Diffusion can be viewed as spectral autoregression dieleman2024spectral. Left: Diffusion and its effect on the spatial frequency of images. Right: To investigate the role of different frequency components in image generation, we apply a low or high pass filter to a single diffusion step update (while keeping all other updates unchanged). With all other sources of randomness fixed, we compare the generated samples with and without filtering using LPIPS zhang2018unreasonable, $L_2$ distance of the pixels, SSIM wang2004image and DreamSim fu2023dreamsim. Notably, the influence of low and high pass filters varies depending on whether they are applied early or late in the denoising process.
  • Figure 3: Tokenizing images into patches.
  • Figure 4: Left: We flexify DiTs by allowing them to process images with more patch sizes, by changing the lightweight embedding and de-embedding layers. We showcase this for a class-conditioned ImageNet model. Right: We plot the difference in predictions between a weak and a powerful model. For the first denoising steps, differences are small, and thus using the weak model there allows accelerated generation without performance degradation.
  • Figure 5: We preserve the functional form of the target model for the pre-trained patch size and add new trainable parameters (LoRAs) for each additional patch size we want to fine-tune the model to operate with. We showcase this for a text-to-image/video model that uses cross-attention for text conditioning. We find that freezing cross-attention layers without any additional LoRAs works the best. During inference, we can either keep the LoRAs unmerged (Inference with LoRAs) leading to a slight FLOPs increase that depends on the LoRAs' dimensions, or create different copies of the model for each patch size, by merging the LoRAs (Inference without LoRAs). The latter leads to additional memory requirements. FLOPs and parameter numbers on the right correspond to our flexible T2I Emu model.
  • ...and 40 more figures