Table of Contents
Fetching ...

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Lifu Wang, Daqing Liu, Xinchen Liu, Xiaodong He

TL;DR

The paper addresses the high computational cost of large text encoders in diffusion-based text-to-image synthesis by introducing vision-based knowledge distillation to compress T5-XXL into smaller encoders. Using a step-following distillation scheme guided by diffusion-model predictions and a carefully constructed three-dataset prompt corpus, the authors demonstrate that a distilled T5-Base can achieve image quality and semantic understanding near T5-XXL while being ~50x smaller, with substantial memory and latency gains. Text rendering remains the most size-sensitive attribute, but the approach maintains strong compatibility with auxiliary diffusion modules like ControlNet and LoRA. The work enables high-quality, accessible diffusion-based generation on consumer hardware and provides a scalable framework for distilling text representations in multimodal models.

Abstract

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Scaling Down Text Encoders of Text-to-Image Diffusion Models

TL;DR

The paper addresses the high computational cost of large text encoders in diffusion-based text-to-image synthesis by introducing vision-based knowledge distillation to compress T5-XXL into smaller encoders. Using a step-following distillation scheme guided by diffusion-model predictions and a carefully constructed three-dataset prompt corpus, the authors demonstrate that a distilled T5-Base can achieve image quality and semantic understanding near T5-XXL while being ~50x smaller, with substantial memory and latency gains. Text rendering remains the most size-sensitive attribute, but the approach maintains strong compatibility with auxiliary diffusion modules like ControlNet and LoRA. The work enables high-quality, accessible diffusion-based generation on consumer hardware and provides a scalable framework for distilling text representations in multimodal models.

Abstract

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Paper Structure

This paper contains 28 sections, 5 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Scaling down pattern of text encoders. We distilled T5-XXL into a series of smaller T5 models and evaluated their performance in guiding image synthesis across three key dimensions: image quality, semantic understanding, and text-rendering. Results are from section \ref{['sec:scaling']}. We treat T5-XXL's performance as baseline. Our findings indicate that while image quality and semantic understanding remain largely intact, text-rendering is more sensitive to reductions in model size.
  • Figure 2: Visual and non-visual embedding space illustration. T5 is trained on C4 dataset, in which most data are non-visual natural language. If we use a non-visual prompt to generate an image, the image does not align with the prompt well as shown by the low CLIP score. Therefore we can use a smaller model to learn the useful visual embedding and discard redundant information.
  • Figure 3: Method overview. (a). We first illustrate the step-following distillation algorithm. Starting from a standard Gaussian noise, we pass it along with teacher embedding and student embedding to a diffusion model to obtain two predictions. We use $L_{vision}$ to train the student encoder. After obtaining the teacher latents for next timestep, we pass it to the diffusion model and repeat this process until $x_0$ is obtained. (b). During each step, we use a MLP to project student embedding to teacher's embedding space. We pass the embedding of student and teacher to a diffusion model together with $\mathbf{\hat{x}}_t$ and compute $L_{vision}$ based on predictions of the student and teacher.
  • Figure 4: Model Size vs. Performance. We compare images generated by T5 of different size in all three aspects. We use the same seed and guidance scale of 3.5 for inference. Text-rendering ability is affected the most by model size among three categories. Prompts: (1) "A graceful elf ... standing in an enchanted forest under the dappled sunlight ..."; (2) "A yellow apple and a green elephant" (3) "A panda presenting a board that says 'hello world"'.
  • Figure 5: Showcase of T5-Base performance. T5-Base can generate images with rich details and follow the prompt accurately. We put prompts in appendix for reference.
  • ...and 9 more figures