Table of Contents
Fetching ...

A Novel Evaluation Framework for Image2Text Generation

Jia-Hong Huang, Hongyi Zhu, Yixian Shen, Stevan Rudinac, Alessio M. Pacces, Evangelos Kanoulas

TL;DR

The paper tackles the problem of evaluating automatically generated image captions without relying on costly human references, noting the weak alignment of traditional metrics with human judgments. It proposes a novel, reference-free evaluation framework in which an image captioning model first generates a caption, a large language model then renders an image from that caption, and image features from the original and generated images are compared using cosine similarity. The approach leverages modern LLMs and image encoders to produce a visual fidelity signal that reflects caption quality, validated against human judgments on MSCOCO, Flickr30k, and an augmented dataset. Findings indicate the framework's similarity scores align with human consensus and provide a scalable, complementary tool to traditional metrics for image captioning evaluation.

Abstract

Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.

A Novel Evaluation Framework for Image2Text Generation

TL;DR

The paper tackles the problem of evaluating automatically generated image captions without relying on costly human references, noting the weak alignment of traditional metrics with human judgments. It proposes a novel, reference-free evaluation framework in which an image captioning model first generates a caption, a large language model then renders an image from that caption, and image features from the original and generated images are compared using cosine similarity. The approach leverages modern LLMs and image encoders to produce a visual fidelity signal that reflects caption quality, validated against human judgments on MSCOCO, Flickr30k, and an augmented dataset. Findings indicate the framework's similarity scores align with human consensus and provide a scalable, complementary tool to traditional metrics for image captioning evaluation.

Abstract

Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation offers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the effectiveness of image captioning models. Its efficacy is confirmed through human evaluation.
Paper Structure (16 sections, 2 equations, 6 figures)

This paper contains 16 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Flowchart for image captioning. Existing image captioning architectures can be broadly categorized into two groups: those based on the recurrent neural network (RNN) and those based on the transformer architecture. To aid comprehension, we represent RNN-based methods with blue paths and transformer-based approaches with red paths. The process involves feeding an input image through an image encoder for feature extraction, followed by a language generator to produce text-based descriptions using the extracted image features.
  • Figure 2: Flowchart of the proposed evaluation framework. The proposed framework consists of four main components: an image captioning module, an image feature extractor, a large language model (LLM), and a similarity calculator. The image captioning module employs a chosen model to process an input image and generate textual descriptions. The image feature extractor is tasked with extracting features from the input image. The LLM utilizes the text descriptions produced by the image captioning model to generate the corresponding image. Finally, the similarity calculator computes the similarity between the features of the input image and the image generated by the LLM.
  • Figure 3: Dataset examples. To provide a clearer insight into the introduced human-annotated dataset, we have randomly selected four examples for illustrative purposes. Each image in the dataset is accompanied by five human-annotated descriptions that vividly depict the content of the image.
  • Figure 4: Human evaluation results. The outcomes are derived from three datasets: MSCOCO (highlighted in red), Flickr30k (highlighted in green), and our dataset (highlighted in blue). The top three lines represent scenarios where the provided caption aligns with the correct human-annotated description, while the bottom three lines represent scenarios where the caption is incorrect. "Gap 1", "Gap 2", and "Gap 3" signify the disparities in average cosine similarity scores. We observe that these gaps are approximately $0.2$, indicating the influence of the provided captions on the cosine similarity scores. A larger gap indicates a substantial mismatch between the human-annotated image description and the provided or model-generated caption, whereas a smaller gap suggests a higher degree of alignment.
  • Figure 5: Impact of incorrect image captions. Due to LLMs' proficiency in generating images accurately from provided text prompts, inconsistencies between model-generated image captions and human-annotated ground truth descriptions can lead to discrepancies in the generated images. Leveraging this observation, we propose an evaluation framework to assess the performance of image captioning models without using human-annotated ground truth captions.
  • ...and 1 more figures