Scaling Down Text Encoders of Text-to-Image Diffusion Models

Lifu Wang; Daqing Liu; Xinchen Liu; Xiaodong He

Scaling Down Text Encoders of Text-to-Image Diffusion Models

Lifu Wang, Daqing Liu, Xinchen Liu, Xiaodong He

TL;DR

The paper addresses the high computational cost of large text encoders in diffusion-based text-to-image synthesis by introducing vision-based knowledge distillation to compress T5-XXL into smaller encoders. Using a step-following distillation scheme guided by diffusion-model predictions and a carefully constructed three-dataset prompt corpus, the authors demonstrate that a distilled T5-Base can achieve image quality and semantic understanding near T5-XXL while being ~50x smaller, with substantial memory and latency gains. Text rendering remains the most size-sensitive attribute, but the approach maintains strong compatibility with auxiliary diffusion modules like ControlNet and LoRA. The work enables high-quality, accessible diffusion-based generation on consumer hardware and provides a scalable framework for distilling text representations in multimodal models.

Abstract

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models' ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: "Do we really need such a large text encoder?" In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

Scaling Down Text Encoders of Text-to-Image Diffusion Models

TL;DR

Abstract

Scaling Down Text Encoders of Text-to-Image Diffusion Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)