Table of Contents
Fetching ...

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou

TL;DR

The paper tackles the challenge of reproducing state-of-the-art text-to-image generation without access to proprietary data or full model parameters. It introduces EvolveDirector, which uses a large vision-language model as a director to dynamically curate a high-value training dataset generated via public APIs from advanced models, and trains a diffusion-transformer base model (Edgen) online. Key findings show that as few as ~100K curated samples can match or exceed the performance obtained with ~10M generated samples, and that ensemble data from multiple advanced models can yield superior capabilities. This approach offers a cost-efficient, scalable path toward democratizing access to high-quality T2I generation, with code and weights released for downstream use.

Abstract

Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

TL;DR

The paper tackles the challenge of reproducing state-of-the-art text-to-image generation without access to proprietary data or full model parameters. It introduces EvolveDirector, which uses a large vision-language model as a director to dynamically curate a high-value training dataset generated via public APIs from advanced models, and trains a diffusion-transformer base model (Edgen) online. Key findings show that as few as ~100K curated samples can match or exceed the performance obtained with ~10M generated samples, and that ensemble data from multiple advanced models can yield superior capabilities. This approach offers a cost-efficient, scalable path toward democratizing access to high-quality T2I generation, with code and weights released for downstream use.

Abstract

Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

Paper Structure

This paper contains 18 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Images generated by our model Edgen (EvolveDirector-Gen). Edgen can generate high-quality images with multiple ratios and resolutions. Notably, it excels in generating text and avoiding attribute confusion when generating multiple objects, which are significant characteristics of the most advanced text-to-image models available today. The input text prompts are annotated under the corresponding images.
  • Figure 2: The overview of the proposed framework EvolveDirector. (a) Advanced T2I models provide accessible APIs, allowing users to input text prompts and get the generated images. (b) The base model is trained on the dynamic dataset, consisting of text prompts and corresponding images generated by advanced models via API calls. The VLM continuously evaluates the base model and, according to its performance, dynamically updates and refines the dataset through discrimination, expansion, deletion, and mutation operations based on its evaluations.
  • Figure 3: An example of the interaction between the EvolveDirector, VLM, and advanced T2I model. For brevity, auxiliary instructions to the VLM are omitted in this figure.
  • Figure 4: Human evaluation of the images generated by the base model, Edgen trained by the proposed EvolveDirector, and multiple advanced models.
  • Figure 5: Images generated by the base model, Edgen trained by our EvolveDirector, and multiple advanced models. The results in three rows showcase the generation of human, text, and multi-object.
  • ...and 5 more figures