Rethinking cluster-conditioned diffusion models for label-free image synthesis

Nikolas Adaloglou; Tim Kaiser; Felix Michels; Markus Kollmann

Rethinking cluster-conditioned diffusion models for label-free image synthesis

Nikolas Adaloglou, Tim Kaiser, Felix Michels, Markus Kollmann

TL;DR

It is shown that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency, and proposes a novel empirical method to estimate an upper bound for the optimal number of clusters.

Abstract

Diffusion-based image generation models can enhance image quality when conditioned on ground truth labels. Here, we conduct a comprehensive experimental study on image-level conditioning for diffusion models using cluster assignments. We investigate how individual clustering determinants, such as the number of clusters and the clustering method, impact image synthesis across three different datasets. Given the optimal number of clusters with respect to image synthesis, we show that cluster-conditioning can achieve state-of-the-art performance, with an FID of 1.67 for CIFAR10 and 2.17 for CIFAR100, along with a strong increase in training sample efficiency. We further propose a novel empirical method to estimate an upper bound for the optimal number of clusters. Unlike existing approaches, we find no significant association between clustering performance and the corresponding cluster-conditional FID scores. The code is available at https://github.com/HHU-MMBS/cedm-official-wavc2025.

Rethinking cluster-conditioned diffusion models for label-free image synthesis

TL;DR

Abstract

Paper Structure (26 sections, 3 equations, 15 figures, 8 tables)

This paper contains 26 sections, 3 equations, 15 figures, 8 tables.

Introduction
Related Work
Conditional generative models
Alternative conditioning of generative models
Method
Notations and prerequisites
Cluster-conditional EDM (C-EDM)
Estimating the upper cluster bound
Experimental evaluation
Datasets, models, and metrics
State-of-the-art comparison for image synthesis
Cluster utilization ratio and discovered upper bounds
Investigating the connection between clustering and cluster-conditional image synthesis
Discussion
Conclusion
...and 11 more sections

Figures (15)

Figure 1: An ideal image-level conditioning should group images based on shared patterns, shown in the same row, which do not always align with human labels, indicated above each image (CIFAR100cifar samples).
Figure 2: FID (y-axis) versus seen samples during training in millions (x-axis). TEMI and k-means clusters are computed using the representations of DINO ViT-B dino. We used $C_V=100,200,400$ for CIFAR10, CIFAR100 and FFHQ-64 respectively. The training sample efficiency compared to the unconditional baseline is indicated by the arrow. Best viewed in color.
Figure 3: FID (left y-axis) and TEMI cluster utilization ratio $r_C$ (right y-axis) across different numbers of clusters $C$ (x-axis) using C-EDM, evaluated at $M_{img}=100$. The green area indicates the discovered cluster range $[2,C_{max})$ for $r_C\leq \alpha=0.96$.
Figure 4: FID (y-axis) across different numbers of clusters $C$ (x-axis) using C-EDM with TEMI with different feature extractors. The ANMI is shown in parentheses for $C_V$=100.
Figure 5: Top 1-NN cosine similarity AUROC (left y-axis) and Frechet distance between the C-EDM and unconditional samples (uFID) for different cluster sizes $C$ (x-axis). For the computation of AUROC, we use the official test splits.
...and 10 more figures

Rethinking cluster-conditioned diffusion models for label-free image synthesis

TL;DR

Abstract

Rethinking cluster-conditioned diffusion models for label-free image synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (15)