Table of Contents
Fetching ...

Attribute Based Interpretable Evaluation Metrics for Generative Models

Dongkyun Kim, Mingi Kwon, Youngjung Uh

TL;DR

The paper tackles the interpretability gap in evaluating generative models by introducing an attribute-focused evaluation framework. It defines Heterogeneous CLIPScore (HCS) to quantify image–text attribute strengths and introduces two metrics, Single-attribute Divergence (SaD) and Paired-attribute Divergence (PaD), which measure how generated data diverges from training data in terms of attribute distributions and attribute relationships. Through KDE-based PDF estimation and KL divergence, SaD and PaD reveal which attributes and attribute pairs are misrepresented, and they align with human judgments while exposing model-specific weaknesses not captured by traditional metrics like FID. The approach is validated across several models and tasks (e.g., FFHQ, LSUN Cat, COCO-based text-to-image), demonstrating improved interpretability and actionable insights for model development and fair evaluation in generative modeling. Overall, the framework lays the groundwork for explainable, attribute-aware evaluation that can guide model selection and bias analysis in practice, with considerations for KDE sample size, VLM quality, and attribute extraction biases.

Abstract

When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.

Attribute Based Interpretable Evaluation Metrics for Generative Models

TL;DR

The paper tackles the interpretability gap in evaluating generative models by introducing an attribute-focused evaluation framework. It defines Heterogeneous CLIPScore (HCS) to quantify image–text attribute strengths and introduces two metrics, Single-attribute Divergence (SaD) and Paired-attribute Divergence (PaD), which measure how generated data diverges from training data in terms of attribute distributions and attribute relationships. Through KDE-based PDF estimation and KL divergence, SaD and PaD reveal which attributes and attribute pairs are misrepresented, and they align with human judgments while exposing model-specific weaknesses not captured by traditional metrics like FID. The approach is validated across several models and tasks (e.g., FFHQ, LSUN Cat, COCO-based text-to-image), demonstrating improved interpretability and actionable insights for model development and fair evaluation in generative modeling. Overall, the framework lays the groundwork for explainable, attribute-aware evaluation that can guide model selection and bias analysis in practice, with considerations for KDE sample size, VLM quality, and attribute extraction biases.

Abstract

When the training dataset comprises a 1:1 proportion of dogs to cats, a generative model that produces 1:1 dogs and cats better resembles the training species distribution than another model with 3:1 dogs and cats. Can we capture this phenomenon using existing metrics? Unfortunately, we cannot, because these metrics do not provide any interpretability beyond "diversity". In this context, we propose a new evaluation protocol that measures the divergence of a set of generated images from the training set regarding the distribution of attribute strengths as follows. Single-attribute Divergence (SaD) measures the divergence regarding PDFs of a single attribute. Paired-attribute Divergence (PaD) measures the divergence regarding joint PDFs of a pair of attributes. They provide which attributes the models struggle. For measuring the attribute strengths of an image, we propose Heterogeneous CLIPScore (HCS) which measures the cosine similarity between image and text vectors with heterogeneous initial points. With SaD and PaD, we reveal the following about existing generative models. ProjectedGAN generates implausible attribute relationships such as a baby with a beard even though it has competitive scores of existing metrics. Diffusion models struggle to capture diverse colors in the datasets. The larger sampling timesteps of latent diffusion model generate the more minor objects including earrings and necklaces. Stable Diffusion v1.5 better captures the attributes than v2.1. Our metrics lay a foundation for explainable evaluations of generative models.
Paper Structure (52 sections, 7 equations, 16 figures, 24 tables)

This paper contains 52 sections, 7 equations, 16 figures, 24 tables.

Figures (16)

  • Figure 1: Conceptual illustration of our metric. We design the scenario, Model 2 lacks diversity. (a) Although existing metrics (gray box) capture the inferiority of Model 2, they do not provide an explanation for the judgments. (b) Our attribute-based proposed metric (green box) has an interpretation: Model 2 is biased regarding long hair, makeup, smiling, and beard.
  • Figure 2: Illustration of CLIPScore and Heterogeneous CLIPScore. We visualized the CLIP embedding space obtained from multiple texts. The yellow ellipse represents the embedding space of CelebA’s text attributes, while the blue ellipse visualizes the embedding space of images. (a) CLIPScore (CS) evaluates the similarity between $V^{CS}_{img}$ and $V^{CS}_{Text}$ from the coordinate origin, where the angle between the two vectors is bounded, resulting in a limited similarity value. (b) Heterogeneous CLIPScore (HCS) gauges the similarity between $V^{HCS}_{img}$ and $V^{HCS}_{Text}$ using the defined means of images $C_\mathcal{X}$ and texts $C_\mathcal{A}$ as the origin, the range of similarity is unrestricted. (c) shows flexible values of HCS compared to CS.
  • Figure 3: Validation of metrics through biased injection. We design one set: typical 30K of FFHQ images, and another set: 30K FFHQ + injected images. Biased data injection, illustrated in (a) with makeup and (b) with bangs leads to an increase in both SaD and PaD rise. In contrast, unbiased data injection (c) person and (d) real data, injecting the same distribution as the training set results in no SaD and PaD rise. Our metrics effectively capture changes in attribute distribution, while existing metrics cannot.
  • Figure 4: Failure cases by ProjectedGAN. ProjectedGAN disregards attribute relationships, such as generating babies with beards.
  • Figure 5: LDM with 50 steps v.s. LDM with 200 timesteps. With increased sampling timesteps, (a) SaD of LDM gets worse, (b) since making too many fine objects such as earrings or necklace.
  • ...and 11 more figures