Table of Contents
Fetching ...

SelfEval: Leveraging the discriminative nature of generative models for evaluation

Sai Saketh Rambhatla, Ishan Misra

TL;DR

SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics, and hopes it enables easy and reliable automated evaluation for diffusion models.

Abstract

We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, and the likelihood can be used to perform recognition tasks with the generative model. We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. Existing automated metrics rely on an external pretrained model like CLIP (VLMs) or LLMs, and are sensitive to the exact pretrained model and its limitations. SelfEval sidesteps these issues, and to the best of our knowledge, is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics. SelfEval also reveals that generative models showcase competitive recognition performance on challenging tasks such as Winoground image-score compared to discriminative models. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.

SelfEval: Leveraging the discriminative nature of generative models for evaluation

TL;DR

SelfEval is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics, and hopes it enables easy and reliable automated evaluation for diffusion models.

Abstract

We present an automated way to evaluate the text alignment of text-to-image generative diffusion models using standard image-text recognition datasets. Our method, called SelfEval, uses the generative model to compute the likelihood of real images given text prompts, and the likelihood can be used to perform recognition tasks with the generative model. We evaluate generative models on standard datasets created for multimodal text-image discriminative learning and assess fine-grained aspects of their performance: attribute binding, color recognition, counting, shape recognition, spatial understanding. Existing automated metrics rely on an external pretrained model like CLIP (VLMs) or LLMs, and are sensitive to the exact pretrained model and its limitations. SelfEval sidesteps these issues, and to the best of our knowledge, is the first automated metric to show a high degree of agreement for measuring text-faithfulness with the gold-standard human evaluations across multiple generative models, benchmarks and evaluation metrics. SelfEval also reveals that generative models showcase competitive recognition performance on challenging tasks such as Winoground image-score compared to discriminative models. We hope SelfEval enables easy and reliable automated evaluation for diffusion models.
Paper Structure (20 sections, 7 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of proposed method: (Left) Starting from a noised input, the standard diffusion sampling method denoises the input iteratively to generate images from the input distribution. (Middle): SelfEval takes a pair (image $\mathbf{x}_0$ and conditioning $\mathbf{c}$) to estimate the likelihood $p(\mathbf{x}_0 | \mathbf{c})$ of the pair in an iterative fashion. (Right): Given an image, $\mathbf{x}_0$ and $n$ captions, $\{\mathbf{c}_0, \mathbf{c}_1, \dotsc, \mathbf{c}_n\}$, SelfEval is a principled way to convert generative models into discriminative models. We show that the classification performance of these classifiers can be used to evaluate the generative capabilities.
  • Figure 1: Illustration of proposed method: (Left) Starting from a noised input, the standard diffusion sampling method denoises the input iteratively to generate images from the input distribution. (Middle): SelfEval takes an image $x_0$ and conditioning $c$ pairs to estimates the likelihood $p(x_0 | c)$ of the pair in an iterative fashion. (Right): Given an image, $x_0$ and $n$ captions, $\{c_0, c_1, \dotsc, c_n\}$, SelfEval is a principled way to convert generative models into discriminative models. In this work, we show that the classification performance of these classifiers can be used to evaluate the generative capabilities.
  • Figure 2: Drawbacks of CLIP for generative model evaluation. (Left) We compare the CLIP similarity scores of two Latent diffusion models rombach2022high trained with CLIP ViT-L/14 (LDM-CLIP (ViT-L/14)) and OpenCLIP ViT-H/14 (LDM-CLIP (ViT-H/14)) text encoders. On the left, we compare the CLIP similarity scores, computed using CLIP ViT-L/14, on prompts generated from DrawBench, Winoground and, COCO datasets. The plot on the right compares the CLIP similarity scores computed using OpenCLIP ViT-H/14 model. The ranking changes depending on the model used. (Right) CLIP has poor performance in tasks involving counting instances, spatial relationships, matching attributes to objects and understanding corruption of text which constitute about 50 (25%) prompts in DrawBench. In each example, the correct caption is shown in green and CLIP picked the caption in bold. Using CLIP to evaluate text to image models on such prompts is not optimal.
  • Figure 2: Template for Human raters. The template consists of instructions explaining the nature of the task (top) followed by a text prompt with two generations (bottom). Humans are expected to pick one of four options (shown on the right): "both" the generations are faithful, "none" of them are faithful, or if only one of the two images ("Image 1" or "Image 2") demonstrates fidelity to the text prompt.
  • Figure 3: Representative samples from the benchmark. We divide the evaluation into six broad tasks, namely Attribute binding, Color, Count, Shape, Spatial, and Text Corruption. Each task is designed to evaluate a specific aspect of text faithfulness mimicing the categories in DrawBench. Each task is posed as an image-text matching problem, where given an image, the goal is to pick the right caption among distractors. The figure above shows examples from each task with the right caption highlighted in green.
  • ...and 6 more figures