Table of Contents
Fetching ...

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Ruohong Yang, Peng Hu, Xi Peng, Xiting Liu, Yunfan Li

TL;DR

DiFiC tackles fine-grained clustering by leveraging a pre-trained text-to-image diffusion model to distill semantics from textual prompts rather than image features. The method combines semantic distillation, object-focused diffusion via attention masks, and neighborhood-guided clustering to produce compact, discriminative semantic representations $S^*$ that drive clustering. Empirical results on four fine-grained datasets show state-of-the-art ACC and NMI, with clear ablations validating the three-module design. This work highlights a novel use of diffusion models for discriminative tasks and suggests diffusion-based approaches can unlock finer semantic distinctions without heavy reliance on data augmentations.

Abstract

Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

TL;DR

DiFiC tackles fine-grained clustering by leveraging a pre-trained text-to-image diffusion model to distill semantics from textual prompts rather than image features. The method combines semantic distillation, object-focused diffusion via attention masks, and neighborhood-guided clustering to produce compact, discriminative semantic representations that drive clustering. Empirical results on four fine-grained datasets show state-of-the-art ACC and NMI, with clear ablations validating the three-module design. This work highlights a novel use of diffusion models for discriminative tasks and suggests diffusion-based approaches can unlock finer semantic distinctions without heavy reliance on data augmentations.

Abstract

Fine-grained clustering is a practical yet challenging task, whose essence lies in capturing the subtle differences between instances of different classes. Such subtle differences can be easily disrupted by data augmentation or be overwhelmed by redundant information in data, leading to significant performance degradation for existing clustering methods. In this work, we introduce DiFiC a fine-grained clustering method building upon the conditional diffusion model. Distinct from existing works that focus on extracting discriminative features from images, DiFiC resorts to deducing the textual conditions used for image generation. To distill more precise and clustering-favorable object semantics, DiFiC further regularizes the diffusion target and guides the distillation process utilizing neighborhood similarity. Extensive experiments demonstrate that DiFiC outperforms both state-of-the-art discriminative and generative clustering methods on four fine-grained image clustering benchmarks. We hope the success of DiFiC will inspire future research to unlock the potential of diffusion models in tasks beyond generation. The code will be released.

Paper Structure

This paper contains 20 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Our key idea. Existing clustering methods struggle to capture the subtle signals to distinguish fine-grained data. Specifically, (a) discriminative clustering methods heavily rely on data augmentation, which may disrupt subtle semantic differences, leading to inferior between-cluster separation. (b) Generative clustering methods infer latent features for reconstructing all pixels, in which redundant background information may overwhelm fine-grained semantics, resulting in suboptimal within-cluster compactness. (c) Instead of directly learning image features, our method resorts to deducing the textual conditions used for image generation, overcoming the semantic absence or redundancy problem suffered by previous works.
  • Figure 2: Overview of the proposed DiFiC, a fine-grained image clustering method built upon a pre-trained text-to-image diffusion model, which consists of three main modules: (a) Semantic Distillation Module: for each noisy image $x_t$, DiFiC first extracts its semantical proxy word $S^*$ using a semantic extractor, which is then concatenated with a prompt to form the textual condition $c$ for image generation. By requiring the diffusion model to restore the original image, DiFiC distills the image semantics into the proxy word. (b) Object Concentration Module: instead of restoring the full image, DiFiC computes the object mask based on attention maps, which is applied to both the original and generated images when calculating the diffusion loss. As a result, the distilled semantics would center on the main object. (c) Cluster Guidance Module: given proxy word embeddings, DiFiC introduces a clustering head to group images based on neighborhood similarity. The clustering loss simultaneously optimizes the semantic extractor and clustering head, guiding the distillation for producing more compact semantics. In the figure, the bold brown arrows denote the data flow to achieve clustering.
  • Figure 3: Clustering performance of DiFiC on four datasets across the training process, with ACC divided by 100 for simplicity. In the first 100 warm-up epochs, the performance refers to applying $k$-means on proxy words $S^*$. At epoch 100, $\mathcal{L}_{{CG}}$ and the clustering head $g(\cdot)$ are introduced. The sudden performance drop is due to the random initialization of $g(\cdot)$.
  • Figure 4: Visualization of features learned by SeCu, C3-GAN, Stable Diffusion, and our DiFiC on CUB dataset, with the corresponding clustering NMI annotated at the top.
  • Figure 5: The cross-attention maps and the masked images of different $\tilde{t}$.
  • ...and 1 more figures