Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

Vignesh Sundaresha

Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

Vignesh Sundaresha

TL;DR

This work addresses the challenge of deploying diffusion transformers (DiTs) on edge devices by applying knowledge distillation to create parameter- and compute-efficient DiTs (DiT-Nano). It develops design principles for sizing DiTs (depth, width, and heads) and introduces two distillation schemes, Teaching Assistant (TA) and Multi-In-One (MI1), with a practical emphasis on one-step diffusion and offline teacher signals. Empirical results on CIFAR-10 demonstrate that LPIPS-based GET distillation yields strong performance, with a favorable trade-off between model size, image quality (FID), and latency on edge hardware, outperforming a SOTA diffusion-distillation baseline in several metrics. The findings offer actionable guidelines for edge-ready diffusion models and point to future work on analytic justifications and broader design-space exploration for real-world applications.

Abstract

Diffusion Transformers (DiTs) with billions of model parameters form the backbone of popular image and video generation models like DALL.E, Stable-Diffusion and SORA. Though these models are necessary in many low-latency applications like Augmented/Virtual Reality, they cannot be deployed on resource-constrained Edge devices (like Apple Vision Pro or Meta Ray-Ban glasses) due to their huge computational complexity. To overcome this, we turn to knowledge distillation and perform a thorough design-space exploration to achieve the best DiT for a given parameter size. In particular, we provide principles for how to choose design knobs such as depth, width, attention heads and distillation setup for a DiT. During the process, a three-way trade-off emerges between model performance, size and speed that is crucial for Edge implementation of diffusion. We also propose two distillation approaches - Teaching Assistant (TA) method and Multi-In-One (MI1) method - to perform feature distillation in the DiT context. Unlike existing solutions, we demonstrate and benchmark the efficacy of our approaches on practical Edge devices such as NVIDIA Jetson Orin Nano.

Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

TL;DR

Abstract

Designing Parameter and Compute Efficient Diffusion Transformers using Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)