Table of Contents
Fetching ...

Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies

Jonas Loos, Lorenz Linhardt

TL;DR

This work analyzes latent diffusion U-Net representations in Stable Diffusion to assess their suitability as robust features for downstream tasks. By applying representational similarity analyses and token-norm assessments, it uncovers three phenomena: a linearly extractable positional embedding, corner tokens with abnormally high similarity, and high-norm anomalies in up-sampling blocks. The findings hold across SD-1.5, SD-2.1, and SD-Turbo on a subset of ImageNet, highlighting potential pitfalls for tasks requiring spatial locality or reliable feature norms. The work motivates caution and further study of diffusion-model representations before deploying them for robust downstream applications.

Abstract

Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: https://jonasloos.github.io/sd-representation-anomalies

Latent Diffusion U-Net Representations Contain Positional Embeddings and Anomalies

TL;DR

This work analyzes latent diffusion U-Net representations in Stable Diffusion to assess their suitability as robust features for downstream tasks. By applying representational similarity analyses and token-norm assessments, it uncovers three phenomena: a linearly extractable positional embedding, corner tokens with abnormally high similarity, and high-norm anomalies in up-sampling blocks. The findings hold across SD-1.5, SD-2.1, and SD-Turbo on a subset of ImageNet, highlighting potential pitfalls for tasks requiring spatial locality or reliable feature norms. The work motivates caution and further study of diffusion-model representations before deploying them for robust downstream applications.

Abstract

Diffusion models have demonstrated remarkable capabilities in synthesizing realistic images, spurring interest in using their representations for various downstream tasks. To better understand the robustness of these representations, we analyze popular Stable Diffusion models using representational similarity and norms. Our findings reveal three phenomena: (1) the presence of a learned positional embedding in intermediate representations, (2) high-similarity corner artifacts, and (3) anomalous high-norm artifacts. These findings underscore the need to further investigate the properties of diffusion model representations before considering them for downstream tasks that require robust features. Project page: https://jonasloos.github.io/sd-representation-anomalies

Paper Structure

This paper contains 22 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Cosine similarity and Euclidean norm across spatial positions of representations. Each column shows an example of one of the three observations, or of meaningful similarities. Similarities in each column are relative to the token highlighted by a marker of the matching color (×, ×, ×, ×) in one of the images. Representations are extracted from SD-1.5 at the blocks indicated at the top.
  • Figure 2: Quantitative results for position estimation, border/corner artifacts, and high-norm anomalies for SD-1.5. Top row: Linear probe accuracy for position estimation. Brighter shades indicate reduced resolution. Middle row: Relative similarity of tokens lying at a border/corner of the cropped images w.r.t. their similarity before cropping. (log-2 scale). Bottom row: Relative average norm of anomalous tokens w.r.t. to the mean norm of all tokens of the same representation (log-2 scale).
  • Figure 3: Cosine similarity and Euclidean norm for representations of SD-2.1 and SD-Turbo. The similarities are relative to the representation token at the image and location of the marker in the respective image pair. Top left: Positional embedding for SD-2.1 (left), and SD-Turbo (right). Top right: Corner/border anomalies for SD-2.1 (left), and SD-Turbo (right). Bottom: High-norm anomalies for SD-2.1 (left), and SD-Turbo (right).
  • Figure 4: Quantitative results for SD-2.1. Top row: Linear probe accuracy for position estimation. Brighter shades indicate reduced resolution. Middle row: Relative similarity of tokens lying at a border/corner of the cropped images w.r.t. their similarity before cropping. (log-2 scale). Bottom row: Relative average norm of anomalous tokens w.r.t. to the mean norm of all tokens of the same representation (log-2 scale).
  • Figure 5: Quantitative results for SD-Turbo. Top row: Linear probe accuracy for position estimation. Brighter shades indicate reduced resolution. Middle row: Relative similarity of tokens lying at a border/corner of the cropped images w.r.t. their similarity before cropping. (log-2 scale). Bottom row: Relative average norm of anomalous tokens w.r.t. to the mean norm of all tokens of the same representation (log-2 scale).