Table of Contents
Fetching ...

Multi-student Diffusion Distillation for Better One-step Generators

Yanke Song, Jonathan Lorraine, Weili Nie, Karsten Kreis, James Lucas

TL;DR

This work tackles the bottleneck of slow diffusion sampling by introducing Multi-Student Distillation (MSD), a framework that distills a conditional teacher into multiple single-step generators, each responsible for a subset of conditioning inputs. MSD can use the same-sized or smaller students, enabling a mix of capacity and speed gains without increasing inference latency, and can be combined with distribution matching distillation (DM) and adversarial distillation (ADM). A key innovation is the three-stage training path for smaller students, including a Teacher Score Matching pretraining, to provide stable initialization. Empirical results show that four same-sized MSD students surpass single-student baselines in one-step ImageNet-64x64 and COCO2014 generation (FID as low as 1.20 and 8.2, respectively), while four smaller students deliver substantial speedups with competitive quality (e.g., FID as low as 2.88 with 42% fewer parameters). Overall, MSD effectively expands the practical speed-quality frontier for one-step diffusion, enabling real-time generation in demanding applications and offering deployment strategies for large-scale, multi-user environments.

Abstract

Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model's inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of the conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD gets competitive results with faster inference for single-step generation. Using 4 same-sized students, MSD significantly outperforms single-student baseline counterparts and achieves remarkable FID scores for one-step image generation: 1.20 on ImageNet-64x64 and 8.20 on zero-shot COCO2014.

Multi-student Diffusion Distillation for Better One-step Generators

TL;DR

This work tackles the bottleneck of slow diffusion sampling by introducing Multi-Student Distillation (MSD), a framework that distills a conditional teacher into multiple single-step generators, each responsible for a subset of conditioning inputs. MSD can use the same-sized or smaller students, enabling a mix of capacity and speed gains without increasing inference latency, and can be combined with distribution matching distillation (DM) and adversarial distillation (ADM). A key innovation is the three-stage training path for smaller students, including a Teacher Score Matching pretraining, to provide stable initialization. Empirical results show that four same-sized MSD students surpass single-student baselines in one-step ImageNet-64x64 and COCO2014 generation (FID as low as 1.20 and 8.2, respectively), while four smaller students deliver substantial speedups with competitive quality (e.g., FID as low as 2.88 with 42% fewer parameters). Overall, MSD effectively expands the practical speed-quality frontier for one-step diffusion, enabling real-time generation in demanding applications and offering deployment strategies for large-scale, multi-user environments.

Abstract

Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model's inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of the conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students, allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD gets competitive results with faster inference for single-step generation. Using 4 same-sized students, MSD significantly outperforms single-student baseline counterparts and achieves remarkable FID scores for one-step image generation: 1.20 on ImageNet-64x64 and 8.20 on zero-shot COCO2014.

Paper Structure

This paper contains 44 sections, 9 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: We visualize distilling into multiple students, where each student handles a subset of the input condition. At training, students are trained separately with filtered data. At inference, a single student is retrieved for generation, given the corresponding input condition. We show samples in Fig. \ref{['fig:SD_scratch_comparison']}, with methodological details outlined in Fig. \ref{['fig:three-stage_training']}, and quantitative results in Tables \ref{['tab:imagenet_short']} and \ref{['tab:sd_short']}.
  • Figure 2: Samples on high guidance-scale text-to-image generations from the SD v1.5 teacher and different sized students, with full training details in App. \ref{['sec:implementation']}. The same-sized student has comparable quality to the teacher. The smaller student, trained on a subset of dog-related data, achieves faster generation while still having decent qualities. The same-sized student is trained with DM stage only, whereas the smaller student is trained with TSM and DM stages (see Fig. \ref{['fig:three-stage_training']}). See additional samples in Fig. \ref{['fig:SD_all']}.
  • Figure 3: Three-stage training scheme in Eq. \ref{['eqn:MSD-DMD-TSM']}. Acronym meanings: TSM: teacher score matching (Eq. \ref{['eqn:TSM']} & Eq. \ref{['eqn:MSD-DMD-TSM']}); DM: distribution matching (Eq. \ref{['eqn:MSD-DMD-TSM']} & Sec. \ref{['subsec:prelim_dm']}); ADM: adversarial distribution matching (Eq. \ref{['eqn:MSD-DMD-TSM']} and Sec. \ref{['subsec:prelim_adm']}). Stage 1 and Stage 2 are techniques from previous works that help with same-sized students; Stage 0 is our contribution, which is required for smaller students who cannot initialize with teacher weights.
  • Figure 4: A 2D toy model. From left to right: teacher (multi-step) generation and student, one-step generation with $1$ and $8$ distilled students, the $\ell_1$ distance of generated samples between teacher and students. Takeaway: More students improve distillation quality on this easy-to-visualize setup.
  • Figure 5: Sample generations on ImageNet-64$\times$64 from the teacher and different sized students, with architecture and latency details in App. \ref{['sec:implementation']}. The same-size students have comparable or slightly better generation quality than the teacher. Smaller students achieve faster generation while still having decent qualities. Same-sized students are trained with DM and ADM stages, whereas smaller students are trained with all three stages as shown in Fig. \ref{['fig:three-stage_training']}.
  • ...and 7 more figures