Table of Contents
Fetching ...

An Analysis of Human Alignment of Latent Diffusion Models

Lorenz Linhardt, Marco Morik, Sidney Bender, Naima Elosegui Borras

TL;DR

The representational alignment with humans is comparable to that of models trained only on ImageNet-1k, and text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

Abstract

Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

An Analysis of Human Alignment of Latent Diffusion Models

TL;DR

The representational alignment with humans is comparable to that of models trained only on ImageNet-1k, and text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

Abstract

Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.
Paper Structure (20 sections, 2 equations, 10 figures)

This paper contains 20 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: We assess the alignment of image representations obtained from different layers of the U-Net with the human representation space via the triplet odd-one-out task. In this task, three images are presented, and participants identify which image is the least similar to the others. This human judgment is then compared to the model's choice of the odd-one-out based on the cosine similarity of representations.
  • Figure 2: Left: Comparison of the OOOA from the best layer of the diffusion model to models analysed by Muttenthaler2023alignment ($\dagger$). Middle/Right: OOOA per layer and noise level for SD2.1 without or with text conditioning, respectively. The alignment of SD2.1 is highest at the second up-sampling block (i.e. 'Up 1'). It is within the lower range of OOOAs observed for models trained on ImageNet-1k. After probing, SD2.1 is more aligned than unimodal self-supervised models or classifiers. Also, label-conditioning (Cond) improves alignment, especially at high noise levels.
  • Figure 3: Per-concept R$^2$-scores for the regression of VICE dimensions from SD2.1 representations, measured at different U-Net blocks for a noise level of 20%. Colors tend to be decodable at shallower layers, whereas most other concepts peak at the second up-sampling block.
  • Figure 4: Top: The decoded latents for different noise levels. Bottom: The images ${\bm{x}}$ reconstructed from the noisy latents via a single forward step by SD2.1.
  • Figure 5: Odd-one-out accuracy for zero-shot representations without text conditioning. Intermediate up-sampling layers are most aligned with human similarity judgments.
  • ...and 5 more figures