Table of Contents
Fetching ...

Visual question answering based evaluation metrics for text-to-image generation

Mizuki Miyamoto, Ryugo Morita, Jinjia Zhou

TL;DR

Experimental results show that the proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

Abstract

Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we use Non-Reference Image Quality Assessment(NR-IQA) to evaluate not only the text-image alignment but also the quality of the generated images. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

Visual question answering based evaluation metrics for text-to-image generation

TL;DR

Experimental results show that the proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

Abstract

Text-to-image generation and text-guided image manipulation have received considerable attention in the field of image generation tasks. However, the mainstream evaluation methods for these tasks have difficulty in evaluating whether all the information from the input text is accurately reflected in the generated images, and they mainly focus on evaluating the overall alignment between the input text and the generated images. This paper proposes new evaluation metrics that assess the alignment between input text and generated images for every individual object. Firstly, according to the input text, chatGPT is utilized to produce questions for the generated images. After that, we use Visual Question Answering(VQA) to measure the relevance of the generated images to the input text, which allows for a more detailed evaluation of the alignment compared to existing methods. In addition, we use Non-Reference Image Quality Assessment(NR-IQA) to evaluate not only the text-image alignment but also the quality of the generated images. Experimental results show that our proposed evaluation approach is the superior metric that can simultaneously assess finer text-image alignment and image quality while allowing for the adjustment of these ratios.

Paper Structure

This paper contains 11 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: An overview of our method. chatGPT ouyang2022training is utilized to generate questions for VQA according to the input text. VQA and NR-IQA are employed to evaluate the Text-Image Alignment (TIA) and image quality of the generated images. The TIA Scoring process involves quantifying the outputs from VQA. The final score is produced by adjusting the weighting between these two scores.
  • Figure 2: Comparison of text-image alignment scores with CLIPScore hessel2021clipscore. There are distinct texts provided for identical two images. One text aligns with the actual content of the image, while the other contains expressions inconsistent with the image content. The bold in the input text represents words that do not align with the image representation. GT(TIA) indicates the rank of the Text-Image Alignment.
  • Figure 3: left: Comparison of text-image alignment scores with CLIPScore hessel2021clipscore and ImageReward xu2023imagereward. There are distinct texts provided for identical three images. The bold in the input text represents mismatched words with the image representation. GT(TIA) indicates the rank of the Text-Image Alignment. right: Comparison of text-image alignment and image quality scores with CLIPScore hessel2021clipscore, ImageReward xu2023imagereward, and MANIQA yang2022maniqa. Applying degradation operations to three images with different JPEG compression rates. GT(IQA) represents the rank of Image Quality Assessment. Ours(TIA + IQA) score is the combination of TIAscore and IQAscore with equal weighting.