Table of Contents
Fetching ...

VersaT2I: Improving Text-to-Image Models with Versatile Reward

Jianshu Guo, Wenhao Chai, Jie Deng, Hsiang-Wei Huang, Tian Ye, Yichen Xu, Jiawei Zhang, Jenq-Neng Hwang, Gaoang Wang

TL;DR

VersaT2I addresses persistent shortcomings in text-to-image synthesis by decomposing image quality into four measurable aspects and employing self-generated training data with LoRA fine-tuning. It avoids reinforcement learning by using per-aspect reward models (aesthetics, text-faithfulness, geometry, and low-level quality) and introduces Mixture of LoRA (MoL) to intelligently fuse multiple aspect-specific LoRAs via a gating mechanism with balancing constraints. The framework demonstrates improvements across multiple quality criteria on SD v2.1 and SDXL, including human-preference metrics, while remaining model-agnostic and annotation-free. Overall, VersaT2I offers a scalable, efficient pathway to higher-quality T2I outputs without costly data labeling or RL-based optimization.

Abstract

Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria.

VersaT2I: Improving Text-to-Image Models with Versatile Reward

TL;DR

VersaT2I addresses persistent shortcomings in text-to-image synthesis by decomposing image quality into four measurable aspects and employing self-generated training data with LoRA fine-tuning. It avoids reinforcement learning by using per-aspect reward models (aesthetics, text-faithfulness, geometry, and low-level quality) and introduces Mixture of LoRA (MoL) to intelligently fuse multiple aspect-specific LoRAs via a gating mechanism with balancing constraints. The framework demonstrates improvements across multiple quality criteria on SD v2.1 and SDXL, including human-preference metrics, while remaining model-agnostic and annotation-free. Overall, VersaT2I offers a scalable, efficient pathway to higher-quality T2I outputs without costly data labeling or RL-based optimization.

Abstract

Recent text-to-image (T2I) models have benefited from large-scale and high-quality data, demonstrating impressive performance. However, these T2I models still struggle to produce images that are aesthetically pleasing, geometrically accurate, faithful to text, and of good low-level quality. We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We decompose the quality of the image into several aspects such as aesthetics, text-image alignment, geometry, low-level quality, etc. Then, for every quality aspect, we select high-quality images in this aspect generated by the model as the training set to finetune the T2I model using the Low-Rank Adaptation (LoRA). Furthermore, we introduce a gating function to combine multiple quality aspects, which can avoid conflicts between different quality aspects. Our method is easy to extend and does not require any manual annotation, reinforcement learning, or model architecture changes. Extensive experiments demonstrate that VersaT2I outperforms the baseline methods across various quality criteria.
Paper Structure (22 sections, 12 equations, 6 figures, 4 tables)

This paper contains 22 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: We present VersaT2I, a versatile training framework that can boost the performance with multiple rewards of any T2I model. We conduct different reward model under four aspects: aesthetics, text-image alignment, geometry, and low-level quality. Above figure shows the result before and after fine-tuning using SDXL as base model.
  • Figure 2: Overview of our proposed VersaT2I framework under two-stage training. Firstly, given a prompt, a text-to-image generative model generates a batch of images as a candidate set, which then is evaluated by a pre-trained reward model. After that, we fine-tune the T2I model using LoRA with samples of the best score among the candidate sets as the training set. After obtaining LoRA for each reward model, we further compose multiple LoRAs in a Mixture of LoRA layer design to achieve versatile improvement.
  • Figure 3: Qualitative results of proposed VersaT2I improving aesthetics, text-image alignment, geometry, low-level quality after training. All the image resolution is $1024 \times 1024$ and generated with same noise and seed for fair comparison.
  • Figure 4: Ablation study on Mixture-of-LoRA design qualitatively. Our proposed method acheives better performance compared to naively average the LoRA weights from different reward models. All the image resolution is $1024 \times 1024$ and generated with same noise and seed for fair comparison.
  • Figure 5: More examples generated by VersaT2I. Given the text prompt and fixed seed, we generate images with SDXLpodell2023sdxl and VersaT2I. Images generated by VersaT2I exhibit high quality.
  • ...and 1 more figures