Table of Contents
Fetching ...

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Evangelos Kanoulas

TL;DR

A novel evaluation framework called Image2Text2Image is proposed, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation and does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models.

Abstract

Evaluating the quality of automatically generated image descriptions is a complex task that requires metrics capturing various dimensions, such as grammaticality, coverage, accuracy, and truthfulness. Although human evaluation provides valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr attempt to fill this gap, but they often exhibit weak correlations with human judgment. To address this challenge, we propose a novel evaluation framework called Image2Text2Image, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation. In the Image2Text2Image framework, an input image is first processed by a selected image captioning model, chosen for evaluation, to generate a textual description. Using this generated description, a diffusion model then creates a new image. By comparing features extracted from the original and generated images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies, revealing potential weaknesses in the model's performance. Notably, our framework does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models. Extensive experiments and human evaluations validate the efficacy of our proposed Image2Text2Image evaluation framework. The code and dataset will be published to support further research in the community.

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

TL;DR

A novel evaluation framework called Image2Text2Image is proposed, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation and does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models.

Abstract

Evaluating the quality of automatically generated image descriptions is a complex task that requires metrics capturing various dimensions, such as grammaticality, coverage, accuracy, and truthfulness. Although human evaluation provides valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr attempt to fill this gap, but they often exhibit weak correlations with human judgment. To address this challenge, we propose a novel evaluation framework called Image2Text2Image, which leverages diffusion models, such as Stable Diffusion or DALL-E, for text-to-image generation. In the Image2Text2Image framework, an input image is first processed by a selected image captioning model, chosen for evaluation, to generate a textual description. Using this generated description, a diffusion model then creates a new image. By comparing features extracted from the original and generated images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the model has produced a faithful textual description, while a low score highlights discrepancies, revealing potential weaknesses in the model's performance. Notably, our framework does not rely on human-annotated reference captions, making it a valuable tool for assessing image captioning models. Extensive experiments and human evaluations validate the efficacy of our proposed Image2Text2Image evaluation framework. The code and dataset will be published to support further research in the community.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Flowchart of the proposed evaluation framework. The proposed framework consists of four main components: an image captioning module, an image encoder, a text-to-image generation model (Stable Diffusion), and a similarity calculator. The image captioning module employs a chosen model to process an input image and generate textual descriptions. The image encoder is tasked with extracting features from the input image. The text-to-image generation model utilizes the text descriptions produced by the image captioning model to generate the corresponding image. Finally, the similarity calculator computes the similarity between the features of the input image and the image generated by the text-to-image generation model.
  • Figure 2: Human evaluation results. The top three lines represent scenarios where the provided caption aligns with the correct human-annotated description, while the bottom three lines represent scenarios where the caption is incorrect. "Gap 1", "Gap 2", and "Gap 3" signify the disparities in average cosine similarity scores.
  • Figure 3: Correlations with human judgment for the Flickr8K-Expert.
  • Figure 4: Correlation with human judgment for Flickr8k-CF, a version of Flickr-8k dataset annotated through crowdsourcing.
  • Figure 5: Pairwise accuracy results on the FOIL hallucination detection. The baseline models use either one or four references.
  • ...and 2 more figures