Table of Contents
Fetching ...

TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information

Ruilin Zhang, Haiyang Zheng, Hongpeng Wang

Abstract

Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.

TDEC: Deep Embedded Image Clustering with Transformer and Distribution Information

Abstract

Image clustering is a crucial but challenging task in multimedia machine learning. Recently the combination of clustering with deep learning has achieved promising performance against conventional methods on high-dimensional image data. Unfortunately, existing deep clustering methods (DC) often ignore the importance of information fusion with a global perception field among different image regions on clustering images, especially complex ones. Additionally, the learned features are usually clustering-unfriendly in terms of dimensionality and are based only on simple distance information for the clustering. In this regard, we propose a deep embedded image clustering TDEC, which for the first time to our knowledge, jointly considers feature representation, dimensional preference, and robust assignment for image clustering. Specifically, we introduce the Transformer to form a novel module T-Encoder to learn discriminative features with global dependency while using the Dim-Reduction block to build a friendly low-dimensional space favoring clustering. Moreover, the distribution information of embedded features is considered in the clustering process to provide reliable supervised signals for joint training. Our method is robust and allows for more flexibility in data size, the number of clusters, and the context complexity. More importantly, the clustering performance of TDEC is much higher than recent competitors. Extensive experiments with state-of-the-art approaches on complex datasets show the superiority of TDEC.

Paper Structure

This paper contains 15 sections, 15 equations, 9 figures, 6 tables, 1 algorithm.

Figures (9)

  • Figure 1: The framework of our proposed TDEC. It consists of T-Encoder, T-Decoder, Dim-Reduction block (DR), and Clustering Head (CH). Note that $L_{stru}$ and $L_{clu}$ represent the structure loss (the linear combination of reconstruction loss $L_{rec}$ and dimension reduction loss $L_{dim}$) and clustering loss.
  • Figure 2: An individual Transformer block.
  • Figure 3: The architecture of Dim-Reduction block. Here the features $Z^i_w$ of size 1*10 learned by the T-Encoder for raw image $x_i$ is used as input to project a more cluster-friendly feature $Z^i_v$ of size 1*2.
  • Figure 4: The convergence process of our model on MNIST
  • Figure 5: Clustering samples on the YTF dataset.Each row contains the top 20 scoring images from one cluster, based on the distance from the cluster center.
  • ...and 4 more figures