Table of Contents
Fetching ...

Rethinking cluster-conditioned diffusion models for label-free image synthesis

Nikolas Adaloglou, Tim Kaiser, Felix Michels, Markus Kollmann

TL;DR

It is shown that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency, and proposes a novel empirical method to estimate an upper bound for the optimal number of clusters.

Abstract

Diffusion-based image generation models can enhance image quality when conditioned on ground truth labels. Here, we conduct a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We investigate how individual clustering determinants, such as the number of clusters and the clustering method, impact image synthesis across three different datasets. Given the optimal number of clusters with respect to image synthesis, we show that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency. We further propose a novel empirical method to estimate an upper bound for the optimal number of clusters. Unlike existing approaches, we find no significant association between clustering performance and the corresponding cluster-conditional FID scores. The code is available at https://github.com/HHU-MMBS/cedm-official-wavc2025.

Rethinking cluster-conditioned diffusion models for label-free image synthesis

TL;DR

It is shown that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency, and proposes a novel empirical method to estimate an upper bound for the optimal number of clusters.

Abstract

Diffusion-based image generation models can enhance image quality when conditioned on ground truth labels. Here, we conduct a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We investigate how individual clustering determinants, such as the number of clusters and the clustering method, impact image synthesis across three different datasets. Given the optimal number of clusters with respect to image synthesis, we show that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency. We further propose a novel empirical method to estimate an upper bound for the optimal number of clusters. Unlike existing approaches, we find no significant association between clustering performance and the corresponding cluster-conditional FID scores. The code is available at https://github.com/HHU-MMBS/cedm-official-wavc2025.
Paper Structure (26 sections, 3 equations, 15 figures, 8 tables)

This paper contains 26 sections, 3 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: An ideal image-level conditioning should group images based on shared patterns, shown in the same row, which do not always align with human labels, indicated above each image (CIFAR100cifar samples).
  • Figure 2: FID (y-axis) versus seen samples during training in millions (x-axis). TEMI and k-means clusters are computed using the representations of DINO ViT-B dino. We used $C_V=100,200,400$ for CIFAR10, CIFAR100 and FFHQ-64 respectively. The training sample efficiency compared to the unconditional baseline is indicated by the arrow. Best viewed in color.
  • Figure 3: FID (left y-axis) and TEMI cluster utilization ratio $r_C$ (right y-axis) across different numbers of clusters $C$ (x-axis) using C-EDM, evaluated at $M_{img}=100$. The green area indicates the discovered cluster range $[2,C_{max})$ for $r_C\leq \alpha=0.96$.
  • Figure 4: FID (y-axis) across different numbers of clusters $C$ (x-axis) using C-EDM with TEMI with different feature extractors. The ANMI is shown in parentheses for $C_V$=100.
  • Figure 5: Top 1-NN cosine similarity AUROC (left y-axis) and Frechet distance between the C-EDM and unconditional samples (uFID) for different cluster sizes $C$ (x-axis). For the computation of AUROC, we use the official test splits.
  • ...and 10 more figures