FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute
Sotiris Anagnostidis, Gregor Bachmann, Yeongmin Kim, Jonas Kohler, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Albert Pumarola, Ali Thabet, Edgar Schönfeld
TL;DR
FlexiDiT tackles the high compute cost of diffusion transformers by making inference per step dynamic. It introduces a lightweight, patch size based adaptation that lets a single pre trained DiT operate as multiple sequences with different token counts, achieved via shared parameters or low rank adapters and a minimal set of architectural adjustments. An inference scheduler that employs a weak model for early denoising steps and a powerful model for later steps yields compute reductions of over $40\%$ for images and up to $75\%$ for video, with negligible loss in quality across class conditioned, text conditioned, and video generation tasks. The approach is modality agnostic, enabling broad applicability to multimodal diffusion systems and offering a practical path to scalable, high quality generation.
Abstract
Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into \emph{flexible} ones -- dubbed FlexiDiT -- allowing them to process inputs at varying compute budgets. We demonstrate how a single \emph{flexible} model can generate images without any drop in quality, while reducing the required FLOPs by more than $40$\% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to $75$\% less compute without compromising performance.
