Table of Contents
Fetching ...

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

Charumathi Badrinath, Usha Bhalla, Alex Oesterling, Suraj Srinivas, Himabindu Lakkaraju

TL;DR

The paper investigates cross-model latent-space similarity among VAEs, GANs, Normalizing Flows, and Diffusion Models using linear latent-space stitching. It introduces reconstruction-based and probe-based metrics and tests on CelebA. Findings show that linear maps preserve most visual information across models, with gender attributes being notably similarly represented, and latent representations converge early in training for NFs. These results imply a common latent structure across model families and enable cross-model editing and transfer.

Abstract

Do different generative image models secretly learn similar underlying representations? We investigate this by measuring the latent space similarity of four different models: VAEs, GANs, Normalizing Flows (NFs), and Diffusion Models (DMs). Our methodology involves training linear maps between frozen latent spaces to "stitch" arbitrary pairs of encoders and decoders and measuring output-based and probe-based metrics on the resulting "stitched'' models. Our main findings are that linear maps between latent spaces of performant models preserve most visual information even when latent sizes differ; for CelebA models, gender is the most similarly represented probe-able attribute. Finally we show on an NF that latent space representations converge early in training.

All Roads Lead to Rome? Exploring Representational Similarities Between Latent Spaces of Generative Image Models

TL;DR

The paper investigates cross-model latent-space similarity among VAEs, GANs, Normalizing Flows, and Diffusion Models using linear latent-space stitching. It introduces reconstruction-based and probe-based metrics and tests on CelebA. Findings show that linear maps preserve most visual information across models, with gender attributes being notably similarly represented, and latent representations converge early in training for NFs. These results imply a common latent structure across model families and enable cross-model editing and transfer.

Abstract

Do different generative image models secretly learn similar underlying representations? We investigate this by measuring the latent space similarity of four different models: VAEs, GANs, Normalizing Flows (NFs), and Diffusion Models (DMs). Our methodology involves training linear maps between frozen latent spaces to "stitch" arbitrary pairs of encoders and decoders and measuring output-based and probe-based metrics on the resulting "stitched'' models. Our main findings are that linear maps between latent spaces of performant models preserve most visual information even when latent sizes differ; for CelebA models, gender is the most similarly represented probe-able attribute. Finally we show on an NF that latent space representations converge early in training.
Paper Structure (20 sections, 13 figures, 2 tables)

This paper contains 20 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Reconstruction of a CelebA image (left) using stitched models. Latent space MSEs of mapped latents are displayed above each image. The stitched models yielding the best reconstructions use the VQVAE and NF encoders.
  • Figure 2: Reconstruction of a CelebA-Synthetic image using various encoders stitched to the GAN decoder. The leftmost image is the ground truth. Stitched models using the VQVAE and NF encoders yield the closest reconstructions.
  • Figure 3: Heatmaps of latent-space MSE, pixel-space RMSE, LPIPS and FID for images reconstructed by stitched models. We see relatively low values of these metrics for models using the NF and VQVAE encoders, corroborating the results of Figure \ref{['fig:celeba-mapped']}.
  • Figure 4: The top 4 rows show latent space linear probe accuracy on various binary attributes. The NF and VQVAE have the most linearly separable latent spaces. The remaining rows show change in accuracy of probe trained on model $\mathbb{X}$'s latent space when making predictions on latents encoded by $\mathbb{X}$ and latents mapped from $\mathbb{X}' \rightarrow \mathbb{X}$. Probe accuracy increases on latents mapped from a more linearly separable to less linearly separable latent space.
  • Figure 5: Percentage of times a probe trained on model $\mathbb{X}$'s latent space produces the same prediction on latents encoded by $\mathbb{X}$ and latents mapped from $\mathbb{X}' \rightarrow \mathbb{X}$. Attributes correlated with gender are represented similarly by nearly every model.
  • ...and 8 more figures