Table of Contents
Fetching ...

Diffusion Mental Averages

Phonphrm Thawatdamrongkit, Sukit Seripanitkarn, Supasorn Suwajanakorn

Abstract

Can a diffusion model produce its own "mental average" of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

Diffusion Mental Averages

Abstract

Can a diffusion model produce its own "mental average" of a concept-one that is as sharp and realistic as a typical sample? We introduce Diffusion Mental Averages (DMA), a model-centric answer to this question. While prior methods aim to average image collections, they produce blurry results when applied to diffusion samples from the same prompt. These data-centric techniques operate outside the model, ignoring the generative process. In contrast, DMA averages within the diffusion model's semantic space, as discovered by recent studies. Since this space evolves across timesteps and lacks a direct decoder, we cast averaging as trajectory alignment: optimize multiple noise latents so their denoising trajectories progressively converge toward shared coarse-to-fine semantics, yielding a single sharp prototype. We extend our approach to multimodal concepts (e.g., dogs with many breeds) by clustering samples in semantically-rich spaces such as CLIP and applying Textual Inversion or LoRA to bridge CLIP clusters into diffusion space. This is, to our knowledge, the first approach that delivers consistent, realistic averages, even for abstract concepts, serving as a concrete visual summary and a lens into model biases and concept representation.

Paper Structure

This paper contains 35 sections, 4 equations, 35 figures, 4 tables, 1 algorithm.

Figures (35)

  • Figure 1: Our method unveils the Mental Averages encoded by pre-trained diffusion models across diverse concepts (left) and generalizes across different model variants (right), offering new tools for analyzing model biases and interpreting learned concept representations.
  • Figure 2: Overview. Multiple noise latents are jointly optimized so that their denoising trajectories converge toward shared semantics. At each timestep $t$, their $h$-space activations are averaged to form a semantic target, and each latent is optimized to match it before denoising to the next step. Repeating this process across timesteps aligns coarse-to-fine semantics, yielding a single "mental average" of the concept.
  • Figure 3: Qualitative comparison across methods: GANgealingpeebles2022gan, Avg. VAE, D$^4$Msu2024d, MGD$^3$chan2025mgd, and DMA (Ours). Rows show concepts: astronaut, dog, bicycle, freedom. GANgealing cannot process abstract concepts, so its freedom cell is intentionally left blank.
  • Figure 4: Quality Trade-off. DMA achieves a superior balance of representativeness, consistency, and perceptual quality.
  • Figure 5: DMA prototypes of unsupervised modes. Top row: overall average. Bottom rows: averages of discovered modes.
  • ...and 30 more figures