Table of Contents
Fetching ...

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

Chutian Meng, Fan Ma, Jiaxu Miao, Chi Zhang, Yi Yang, Yueting Zhuang

TL;DR

The Image Regeneration task is introduced to assess text-to-image models by tasking the T2I model with generating an image according to the reference image, and it is demonstrated that a robust T2M can produce images more closely resembling the reference image.

Abstract

Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

TL;DR

The Image Regeneration task is introduced to assess text-to-image models by tasking the T2I model with generating an image according to the reference image, and it is demonstrated that a robust T2M can produce images more closely resembling the reference image.

Abstract

Diffusion models have revitalized the image generation domain, playing crucial roles in both academic research and artistic expression. With the emergence of new diffusion models, assessing the performance of text-to-image models has become increasingly important. Current metrics focus on directly matching the input text with the generated image, but due to cross-modal information asymmetry, this leads to unreliable or incomplete assessment results. Motivated by this, we introduce the Image Regeneration task in this study to assess text-to-image models by tasking the T2I model with generating an image according to the reference image. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model, allowing T2I models to understand image content. This evaluation process is simplified as comparisons between the generated image and the reference image are straightforward. Two regeneration datasets spanning content-diverse and style-diverse evaluation dataset are introduced to evaluate the leading diffusion models currently available. Additionally, we present ImageRepainter framework to enhance the quality of generated images by improving content comprehension via MLLM guided iterative generation and revision. Our comprehensive experiments have showcased the effectiveness of this framework in assessing the generative capabilities of models. By leveraging MLLM, we have demonstrated that a robust T2M can produce images more closely resembling the reference image.

Paper Structure

This paper contains 13 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Architecture comparison among (a) pre-trained model evaluation, (b) QG&QA(Question Generation & Question Answering) evaluation, (c) our image regeneration evaluation, where our approach (c) achieves better alignment with human cognition.
  • Figure 2: Overview of ImageRepainter. The framework consists of two stages: image understanding and iterative generation. These stages are displayed from left to right and interact continuously with LLM. The detailed process of each stage is shown above: (1) Image understanding: Firstly, the image information is organized to generate a tree-like structure called the image understanding tree (IUT), and then initial prompts are generated using the information from IUT. (2)Iterative generation: The second stage comprise four parts: prompt generation, image generation, image selection, and feedback generation.
  • Figure 3: Examples of the generated images by using the prompt from CLIP-interrogator and our proposed IUT. We can observe that the accuracy of the information described in the prompts generated by the CLIP-interrogator is insufficient, leading to unsatisfactory results due to incomplete information.
  • Figure 4: An example IUT construction, which shows that IUT capture more information such as color and facial details of the image than the direct caption.
  • Figure 5: The distribution of style-diverse benchmark and content-diverse benchmark.
  • ...and 4 more figures