An Analysis of Human Alignment of Latent Diffusion Models

Lorenz Linhardt; Marco Morik; Sidney Bender; Naima Elosegui Borras

An Analysis of Human Alignment of Latent Diffusion Models

Lorenz Linhardt, Marco Morik, Sidney Bender, Naima Elosegui Borras

TL;DR

The representational alignment with humans is comparable to that of models trained only on ImageNet-1k, and text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

Abstract

Diffusion models, trained on large amounts of data, showed remarkable performance for image synthesis. They have high error consistency with humans and low texture bias when used for classification. Furthermore, prior work demonstrated the decomposability of their bottleneck layer representations into semantic directions. In this work, we analyze how well such representations are aligned to human responses on a triplet odd-one-out task. We find that despite the aforementioned observations: I) The representational alignment with humans is comparable to that of models trained only on ImageNet-1k. II) The most aligned layers of the denoiser U-Net are intermediate layers and not the bottleneck. III) Text conditioning greatly improves alignment at high noise levels, hinting at the importance of abstract textual information, especially in the early stage of generation.

An Analysis of Human Alignment of Latent Diffusion Models

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 10 figures)

This paper contains 20 sections, 2 equations, 10 figures.

Introduction
Contributions
Method
Representation Extraction
Representational Alignment With Humans
Alignment by Affine Probing
Experiments
How Well Aligned are the Representations of Diffusion Models?
Can the Representation be Aligned Easily?
How Does Alignment Vary Across Layers?
Do Layers Encode Different Concepts?
What is the Impact of Text-Conditioning on Alignment?
Conclusion
Related Work
Visualization of Noise Levels
...and 5 more sections

Figures (10)

Figure 1: We assess the alignment of image representations obtained from different layers of the U-Net with the human representation space via the triplet odd-one-out task. In this task, three images are presented, and participants identify which image is the least similar to the others. This human judgment is then compared to the model's choice of the odd-one-out based on the cosine similarity of representations.
Figure 2: Left: Comparison of the OOOA from the best layer of the diffusion model to models analysed by Muttenthaler2023alignment ($\dagger$). Middle/Right: OOOA per layer and noise level for SD2.1 without or with text conditioning, respectively. The alignment of SD2.1 is highest at the second up-sampling block (i.e. 'Up 1'). It is within the lower range of OOOAs observed for models trained on ImageNet-1k. After probing, SD2.1 is more aligned than unimodal self-supervised models or classifiers. Also, label-conditioning (Cond) improves alignment, especially at high noise levels.
Figure 3: Per-concept R$^2$-scores for the regression of VICE dimensions from SD2.1 representations, measured at different U-Net blocks for a noise level of 20%. Colors tend to be decodable at shallower layers, whereas most other concepts peak at the second up-sampling block.
Figure 4: Top: The decoded latents for different noise levels. Bottom: The images ${\bm{x}}$ reconstructed from the noisy latents via a single forward step by SD2.1.
Figure 5: Odd-one-out accuracy for zero-shot representations without text conditioning. Intermediate up-sampling layers are most aligned with human similarity judgments.
...and 5 more figures

An Analysis of Human Alignment of Latent Diffusion Models

TL;DR

Abstract

An Analysis of Human Alignment of Latent Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)