Table of Contents
Fetching ...

Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk

TL;DR

This work addresses the inefficiency of large diffusion models by showing that distilled few-step students can outperform their teachers on a meaningful subset of samples. It introduces an adaptive teacher-student collaboration where a student generates an initial image and an inexpensive oracle gates whether the teacher should refine or regenerate the sample, yielding gains in both speed and quality. The approach is validated across text-to-image synthesis, editing, and controllable generation, with improvements in automated metrics and human preferences, and demonstrates practical benefits for editing and conditioning tasks. Overall, the method offers a practical path to faster, better text-conditioned diffusion synthesis by leveraging the complementary strengths of student and teacher models.

Abstract

Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed, the overall quality of student samples is typically lower compared to the teacher ones, which hinders their practical usage. In this work, we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding, we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones, despite the "approximate" nature of the student. Based on this finding, we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically, the distilled model produces the initial sample, and then an oracle decides whether it needs further improvements with a slow teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore, the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.

Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models

TL;DR

This work addresses the inefficiency of large diffusion models by showing that distilled few-step students can outperform their teachers on a meaningful subset of samples. It introduces an adaptive teacher-student collaboration where a student generates an initial image and an inexpensive oracle gates whether the teacher should refine or regenerate the sample, yielding gains in both speed and quality. The approach is validated across text-to-image synthesis, editing, and controllable generation, with improvements in automated metrics and human preferences, and demonstrates practical benefits for editing and conditioning tasks. Overall, the method offers a practical path to faster, better text-conditioned diffusion synthesis by leveraging the complementary strengths of student and teacher models.

Abstract

Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed, the overall quality of student samples is typically lower compared to the teacher ones, which hinders their practical usage. In this work, we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding, we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones, despite the "approximate" nature of the student. Based on this finding, we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically, the distilled model produces the initial sample, and then an oracle decides whether it needs further improvements with a slow teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore, the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation.
Paper Structure (24 sections, 30 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 30 figures, 5 tables, 1 algorithm.

Figures (30)

  • Figure 1: Left: Overview of the proposed approach. Right: Side-by-side comparison of SDv$1.5$ and SDXL with their few-step distilled versions. The distilled models surpass the original ones in a noticeable number of samples for the same text prompts and initial noise.
  • Figure 2: Student outperforms its teacher (SD1.5). Left: Text-conditional image synthesis. Right: Text-guided image editing (SDEdit meng2021sdedit). The images within each pair are generated for the same initial noise sample.
  • Figure 3: (a) Visual examples of similar (Left) and dissimilar (Right) teacher and student samples. (b) Similarity between the student and teacher samples w.r.t. the difference in sample quality. Highly distinct samples tend to be of different quality. (c) Human vote distribution for different distance ranges between student and teacher samples. Most of the student wins are achieved when the student diverges from the teacher.
  • Figure 4: Effect of image complexity. (a) More similar student and teacher samples corresponds to simpler images and vice versa. (b) The student and teacher largely diverge in image quality on the complex teacher samples.
  • Figure 5: Effect of text prompts. (a) Shorter prompts usually lead to more similar student and teacher samples. (b) The student and teacher tend to generate more similar images when the student relies heavily on the text prompt.
  • ...and 25 more figures