Table of Contents
Fetching ...

TreeDiffusion: Hierarchical Generative Clustering for Conditional Diffusion

Jorge da Silva Gonçalves, Laura Manduchi, Moritz Vandenhirtz, Julia E. Vogt

TL;DR

TreeDiffusion tackles the gap between clustering and high-fidelity image generation by conditioning diffusion on hierarchical latent representations learned by a TreeVAE. It introduces a two-stage pipeline where TreeVAE performs hierarchical clustering and a DDIM-based diffusion model, guided by a path encoder, generates cluster-specific images. Empirically, the approach improves generation quality (FID) across multiple datasets and preserves clear cluster structure in the output, outperforming a naive TreeVAE+Diffusion baseline. The method also enables interpretable visualizations of the learned latent hierarchy, highlighting the benefits of hierarchical conditioning for generative clustering.

Abstract

Generative modeling and clustering are conventionally distinct tasks in machine learning. Variational Autoencoders (VAEs) have been widely explored for their ability to integrate both, providing a framework for generative clustering. However, while VAEs can learn meaningful cluster representations in latent space, they often struggle to generate high-quality samples. This paper addresses this problem by introducing TreeDiffusion, a deep generative model that conditions diffusion models on learned latent hierarchical cluster representations from a VAE to obtain high-quality, cluster-specific generations. Our approach consists of two steps: first, a VAE-based clustering model learns a hierarchical latent representation of the data. Second, a cluster-aware diffusion model generates realistic images conditioned on the learned hierarchical structure. We systematically compare the generative capabilities of our approach with those of alternative conditioning strategies. Empirically, we demonstrate that conditioning diffusion models on hierarchical cluster representations improves the generative performance on real-world datasets compared to other approaches. Moreover, a key strength of our method lies in its ability to generate images that are both representative and specific to each cluster, enabling more detailed visualization of the learned latent structure. Our approach addresses the generative limitations of VAE-based clustering approaches by leveraging their learned structure, thereby advancing the field of generative clustering.

TreeDiffusion: Hierarchical Generative Clustering for Conditional Diffusion

TL;DR

TreeDiffusion tackles the gap between clustering and high-fidelity image generation by conditioning diffusion on hierarchical latent representations learned by a TreeVAE. It introduces a two-stage pipeline where TreeVAE performs hierarchical clustering and a DDIM-based diffusion model, guided by a path encoder, generates cluster-specific images. Empirically, the approach improves generation quality (FID) across multiple datasets and preserves clear cluster structure in the output, outperforming a naive TreeVAE+Diffusion baseline. The method also enables interpretable visualizations of the learned latent hierarchy, highlighting the benefits of hierarchical conditioning for generative clustering.

Abstract

Generative modeling and clustering are conventionally distinct tasks in machine learning. Variational Autoencoders (VAEs) have been widely explored for their ability to integrate both, providing a framework for generative clustering. However, while VAEs can learn meaningful cluster representations in latent space, they often struggle to generate high-quality samples. This paper addresses this problem by introducing TreeDiffusion, a deep generative model that conditions diffusion models on learned latent hierarchical cluster representations from a VAE to obtain high-quality, cluster-specific generations. Our approach consists of two steps: first, a VAE-based clustering model learns a hierarchical latent representation of the data. Second, a cluster-aware diffusion model generates realistic images conditioned on the learned hierarchical structure. We systematically compare the generative capabilities of our approach with those of alternative conditioning strategies. Empirically, we demonstrate that conditioning diffusion models on hierarchical cluster representations improves the generative performance on real-world datasets compared to other approaches. Moreover, a key strength of our method lies in its ability to generate images that are both representative and specific to each cluster, enabling more detailed visualization of the learned latent structure. Our approach addresses the generative limitations of VAE-based clustering approaches by leveraging their learned structure, thereby advancing the field of generative clustering.

Paper Structure

This paper contains 27 sections, 24 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Schematic overview of the TreeDiffusion framework: TreeVAE encodes data into hierarchical latent variables, where a path is sampled from the root to a leaf node. An encoder network creates a conditioning signal using the sampled hierarchical path embeddings. The diffusion model leverages this information to condition its reverse process and generate a cluster-specific image.
  • Figure 2: Ten different CIFAR-10 reconstructions generated by the TreeVAE model, each obtained by sampling a single path in the tree. Corresponding reconstructions from TreeVAE + Diffusion, which begins denoising with the TreeVAE reconstructions, are shown alongside those from TreeDiffusion, which conditions on the same selected path and embeddings but starts denoising from noise.
  • Figure 3: Ten different samples generated by the TreeVAE model, each generated by sampling one path in the tree, and corresponding samples from the TreeDiffusion model, conditioned on the same selected path and embeddings from TreeVAE.
  • Figure 4: Image generations from every leaf of the TreeVAE and TreeDiffusion model, both trained on the CUBICC dataset. Each row shows the generated images from all leaves of the respective model, starting with the same root sample.
  • Figure 5: TreeDiffusion model trained on FashionMNIST. For each cluster, random newly generated images are displayed. Below each set of images, a normalized histogram (ranging from 0 to 1) shows the distribution of predicted classes from an independent, pre-trained classifier on FashionMNIST for all newly generated images in each leaf with a significant probability of reaching that leaf.
  • ...and 7 more figures