Table of Contents
Fetching ...

Disentangling Mean Embeddings for Better Diagnostics of Image Generators

Sebastian G. Gruber, Pascal Tobias Ziegler, Florian Buettner

TL;DR

The paper tackles the difficulty of evaluating image generators with region-specific diagnostics by introducing a disentanglement of mean embeddings into cluster-wise components using central kernel alignment. It proves that, under a partition with vanishing cross-cluster CKAs, the image-wide cosine mean similarity (CMS) decomposes into a product of cluster-wise CMS terms, enabling localized performance assessment. Practically, the authors identify pixel clusters via pairwise CKA and hierarchical clustering, then monitor cluster-wise CMS during training to pinpoint regions where generators struggle. Experiments on CelebA and ChestMNIST with DCGAN and DDPM architectures illustrate that cluster-level analysis can reveal misbehavior in specific image regions, offering more actionable diagnostics than standard metrics like MMD or FID.

Abstract

The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for individual pixel clusters via central kernel alignment. Consequently, we can quantify the contribution of the cluster-wise performance to the overall image generation performance. We demonstrate how this enhances the explainability and the likelihood of identifying pixel regions of model misbehavior across various real-world use cases.

Disentangling Mean Embeddings for Better Diagnostics of Image Generators

TL;DR

The paper tackles the difficulty of evaluating image generators with region-specific diagnostics by introducing a disentanglement of mean embeddings into cluster-wise components using central kernel alignment. It proves that, under a partition with vanishing cross-cluster CKAs, the image-wide cosine mean similarity (CMS) decomposes into a product of cluster-wise CMS terms, enabling localized performance assessment. Practically, the authors identify pixel clusters via pairwise CKA and hierarchical clustering, then monitor cluster-wise CMS during training to pinpoint regions where generators struggle. Experiments on CelebA and ChestMNIST with DCGAN and DDPM architectures illustrate that cluster-level analysis can reveal misbehavior in specific image regions, offering more actionable diagnostics than standard metrics like MMD or FID.

Abstract

The evaluation of image generators remains a challenge due to the limitations of traditional metrics in providing nuanced insights into specific image regions. This is a critical problem as not all regions of an image may be learned with similar ease. In this work, we propose a novel approach to disentangle the cosine similarity of mean embeddings into the product of cosine similarities for individual pixel clusters via central kernel alignment. Consequently, we can quantify the contribution of the cluster-wise performance to the overall image generation performance. We demonstrate how this enhances the explainability and the likelihood of identifying pixel regions of model misbehavior across various real-world use cases.
Paper Structure (12 sections, 2 theorems, 20 equations, 6 figures, 1 algorithm)

This paper contains 12 sections, 2 theorems, 20 equations, 6 figures, 1 algorithm.

Key Result

Theorem 1

Assume for random variables $X = \left( X_1, \dots, X_d \right)^\intercal$ and $Y = \left( Y_1, \dots, Y_d \right)^\intercal$ with outcomes in a space $\mathcal{X}^d$ and for a p.s.d. kernel $k \colon \mathcal{X} \times \mathcal{X} \to \mathbb{R}$, there exists a partition $\mathbf{I}$ of the indice

Figures (6)

  • Figure 1: Samples of the CelebA dataset. Most faces are centered of similar size and similar angles. The clusters identified in Figure \ref{['fig:clustering']} match this observation.
  • Figure 2: Top-Left: The identified clusters match how a human may separate the image structure of CelebA: There are two clusters for the background (Clusters 4 & 5), two clusters for long hair or alternating head angles (Clusters 1 & 2), and one central cluster for the head and neck (Cluster 3). Top-Right: The correlation matrix in terms of the $\operatorname{CKA}$ values indicates how well the clusters can be separated. The blocks on the diagonal are ordered by cluster number. As can be seen, most clusters are fairly independent of the other clusters (especially Cluster 3). Only clusters 4 and 5 show a relatively strong dependence on each other, which is expected since these often express the same background in the images (c.f. Figure \ref{['fig:celeba_samples']}). Bottom: Unlike the other errors, we can decompose the image-wise $\operatorname{CMS}$ into the $\operatorname{CMS}$ of different clusters according to the $\operatorname{CKA}$. This offers novel insights into model performance. For example, we can detect that Cluster 3 and 5 degrade more in the training collapses than the other clusters.
  • Figure 3: Different errors throughout the training of a DCGAN model on the CelebA dataset. All lines are an average of 20 seeds. Left: The $\operatorname{CMS}$ (higher is better) shows how the average training run improves until epoch 10. After epoch 30, some models collapse, and after epoch 45 additional collapses occur. Computing the product of the cluster-wise $\operatorname{CMS}$ values according to our methodology shows a close match with the normal $\operatorname{CMS}$, indicating the correctness of the clusters. Right: The ISC does match the other errors but shows erratic behavior. The KID and FID resemble the $\operatorname{CMS}$ quite closely. The $\operatorname{MMD}$ also shows a similar trajectory as the other errors but indicates a minimum around epochs 5-7. Generated samples of the training runs match these observations (c.f. Figure \ref{['fig:gen_samples']}).
  • Figure 4: Top-Left: The identified clusters for ChestMNIST match how a human may separate the image structure: There are three clusters for the lung area, one for the abdomen, and two for the upper chest and background. Top-Right: The correlation matrix in terms of the $\operatorname{CKA}$ values indicates how well the clusters can be separated. The blocks on the diagonal are ordered by cluster number. As can be seen, most clusters are fairly independent. Cluster 6 could be further separated. Mid: Comparing the cluster-wise $\operatorname{CMS}$ values throughout training of DCGAN and DDPM architectures shows the difficulty of learning each cluster. The DCGAN architectures have a performance drop mostly due to Cluster 5 and 6 around Epoch 4. Lower: The cluster-wise $\operatorname{CMS}$ values successfully represent the image-wise $\operatorname{CMS}$.
  • Figure 5: Generated samples of the twenty training runs (each row is a seed, each column a sample of a respective seed). Initially, all models are improving their fit. At 21 epochs, no further improvements are visible. At 41 epochs, the training of two models collapsed. At 49 epochs, the training of two additional models collapsed. The collapses are visible in all evaluation metrics in Figure \ref{['fig:image_wise_errors']}, but only with our approach we can quantify the extend to which the individual pixel regions are affected.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1
  • proof
  • Corollary 1