Table of Contents
Fetching ...

TextCraftor: Your Text Encoder Can be Image Quality Controller

Yanyu Li, Xian Liu, Anil Kag, Ju Hu, Yerlan Idelbayev, Dhritiman Sagar, Yanzhi Wang, Sergey Tulyakov, Jian Ren

TL;DR

The findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, the encoder can be enhanced through the proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments.

Abstract

Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.

TextCraftor: Your Text Encoder Can be Image Quality Controller

TL;DR

The findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, the encoder can be enhanced through the proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments.

Abstract

Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation, enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities, these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text, and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations, numerous studies have endeavored to fine-tune the pre-trained diffusion models, i.e., UNet, utilizing various technologies. Yet, amidst these efforts, a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that, instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models, we can enhance it through our proposed fine-tuning approach, TextCraftor, leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly, our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning, and can be combined to further improve generative quality.
Paper Structure (19 sections, 8 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 8 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of TextCraftor, an end-to-end text encoder fine-tuning paradigm based on prompt data and reward functions. The text embedding is forwarded into the DDIM denoising chain to obtain the output image and compute the reward loss, then we backward to update the parameters of the text encoder (and optionally UNet) by maximizing the reward.
  • Figure 2: Qualitative visualizations.Left: generated images on Parti-Prompts, in the order of SDv1.5, prompt engineering, DDPO, and TextCraftor. Right: examples from HPSv2, ordered as SDv1.5, prompt engineering, and TextCraftor.
  • Figure 3: Interpolation between original text embedding (weight $0.0$) and the one from TextCraftor (weight $1.0$) , demonstrating controllable generation. From top to bottom row: TextCraftor using HPSv2, PickScore, and Aesthetics as reward models.
  • Figure 4: Style mixing. Text encoders fine-tuned from different reward models can collaborate and serve as style mixing. The weights listed at the bottom are used for combining text embedding from {origin, Aesthetics, PickScore, HPSv2}, respectively.
  • Figure 5: Ablation on reward models and the effect of CLIP constraint. The leftmost column shows original images. Their averaged Aesthetics, PickScore, and HPSv2 scores are 5.49, 18.19, and 0.2672, respectively. For the following columns, we show the synthesized images without and with CLIP constraint using different reward models. The reward scores are listed at the bottom.
  • ...and 6 more figures