Designing Parameter and Compute Efficient Diffusion Transformers using Distillation
Vignesh Sundaresha
TL;DR
This work addresses the challenge of deploying diffusion transformers (DiTs) on edge devices by applying knowledge distillation to create parameter- and compute-efficient DiTs (DiT-Nano). It develops design principles for sizing DiTs (depth, width, and heads) and introduces two distillation schemes, Teaching Assistant (TA) and Multi-In-One (MI1), with a practical emphasis on one-step diffusion and offline teacher signals. Empirical results on CIFAR-10 demonstrate that LPIPS-based GET distillation yields strong performance, with a favorable trade-off between model size, image quality (FID), and latency on edge hardware, outperforming a SOTA diffusion-distillation baseline in several metrics. The findings offer actionable guidelines for edge-ready diffusion models and point to future work on analytic justifications and broader design-space exploration for real-world applications.
Abstract
Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like Augmented/Virtual Reality, they cannot be deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta Ray-Ban glasses) due to their huge computational complexity. To overcome this, we turn to knowledge distillation and perform a thorough design-space exploration to achieve the best DiT for a given parameter size. In particular, we provide principles for how to choose design knobs such as depth, width, attention heads and distillation setup for a DiT. During the process, a three-way trade-off emerges between model performance, size and speed that is crucial for Edge implementation of diffusion. We also propose two distillation approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to perform feature distillation in the DiT context. Unlike existing solutions, we demonstrate and benchmark the efficacy of our approaches on practical Edge devices such as NVIDIA Jetson Orin Nano.
