Table of Contents
Fetching ...

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models

Jingtao Cao, Zheng Zhang, Hongru Wang, Kam-Fai Wong

TL;DR

VLEU quantitatively measures a model’s generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribution over the images generated by the model.

Abstract

Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models

TL;DR

VLEU quantitatively measures a model’s generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribution over the images generated by the model.

Abstract

Progress in Text-to-Image (T2I) models has significantly improved the generation of images from textual descriptions. However, existing evaluation metrics do not adequately assess the models' ability to handle a diverse range of textual prompts, which is crucial for their generalizability. To address this, we introduce a new metric called Visual Language Evaluation Understudy (VLEU). VLEU uses large language models to sample from the visual text domain, the set of all possible input texts for T2I models, to generate a wide variety of prompts. The images generated from these prompts are evaluated based on their alignment with the input text using the CLIP model.VLEU quantifies a model's generalizability by computing the Kullback-Leibler divergence between the marginal distribution of the visual text and the conditional distribution of the images generated by the model. This metric provides a quantitative way to compare different T2I models and track improvements during model finetuning. Our experiments demonstrate the effectiveness of VLEU in evaluating the generalization capability of various T2I models, positioning it as an essential metric for future research in text-to-image synthesis.
Paper Structure (29 sections, 5 equations, 7 figures, 6 tables)

This paper contains 29 sections, 5 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The loss of generalization of a T2I model. When fine-tuning a T2I model with images of a brown and white dog, as the fine-tuning process advances, prompts for dogs of various colors start to yield outputs that increasingly reflect the characteristics of the dog present in the training dataset. This results in generated images that deviate from the original textual description, indicating a clear case of overfitting and a loss of generalization.
  • Figure 2: Generalizability in T2I Models: A Comparative Visualization. The first model demonstrates strong generalization by successfully producing images that align with various input prompts. In contrast, the second model shows overfitting to the prompt "A white and yellow dog," leading to a failure to generalize to other inputs. As a result, its generated images are generally misaligned with the given prompts. Our proposed VLEU metric aims to quantify this observation.
  • Figure 3: Variation of different metrics during finetuning. In this example, we finetuned SD 1.5 on 5 specific teddy bear images and sampled 25 prompts from the visual text domain covered by various teddy bears using ChatGPT 3.5. For FID, we treated the images in the dataset as the real image distribution.
  • Figure 4: Changes in VLEU of a T2I model during finetuning. Throughout the finetuning of SD 1.5 on five particular teddy bear images, images are generated at every 20 steps using the same prompts. Meanwhile, we calculate the VLEU score at the same step. The figure indicates that the VLEU score gradually decreases as the model begins to overfit, resulting in a loss of generalization and the values align well with this trend.
  • Figure 5: VLEU of different finetuning methods. We finetuned SD 1.5 on 5 specific teddy bears. For DreamBooth, we used 25 teddy bear images generated by the initial model as class images. During the evaluation of VLEU, we used 25 prompts about teddy bears sampled by ChatGPT 3.5.
  • ...and 2 more figures