Table of Contents
Fetching ...

Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning

Jiawei Yao, Qi Qian, Juhua Hu

TL;DR

This work introduces Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework that consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks.

Abstract

Multiple clustering aims to discover various latent structures of data from different aspects. Deep multiple clustering methods have achieved remarkable performance by exploiting complex patterns and relationships in data. However, existing works struggle to flexibly adapt to diverse user-specific needs in data grouping, which may require manual understanding of each clustering. To address these limitations, we introduce Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework in this work. Utilizing the synergistic capabilities of CLIP and GPT-4, Multi-Sub aligns textual prompts expressing user preferences with their corresponding visual representations. This is achieved by automatically generating proxy words from large language models that act as subspace bases, thus allowing for the customized representation of data in terms specific to the user's interests. Our method consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks. Our code is available at https://github.com/Alexander-Yao/Multi-Sub.

Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning

TL;DR

This work introduces Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework that consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks.

Abstract

Multiple clustering aims to discover various latent structures of data from different aspects. Deep multiple clustering methods have achieved remarkable performance by exploiting complex patterns and relationships in data. However, existing works struggle to flexibly adapt to diverse user-specific needs in data grouping, which may require manual understanding of each clustering. To address these limitations, we introduce Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework in this work. Utilizing the synergistic capabilities of CLIP and GPT-4, Multi-Sub aligns textual prompts expressing user preferences with their corresponding visual representations. This is achieved by automatically generating proxy words from large language models that act as subspace bases, thus allowing for the customized representation of data in terms specific to the user's interests. Our method consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks. Our code is available at https://github.com/Alexander-Yao/Multi-Sub.

Paper Structure

This paper contains 25 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The workflow of Multi-Sub that obtains a desired clustering based on the subspace spanned by reference words obtained from GPT-4 using users' high-level interest.
  • Figure 2: Multi-Sub framework. In Multi-Sub framework, Phase I (Proxy Learning and Alignment) processes each image $x_i$ with user-defined textual prompts through a partially learnable image encoder (with a learnable projection layer) and a frozen text encoder. The latent factor $\mathbf{p}_i$ calculates weights $\{a_{i,k}\}_{k=1}^K$ based on the similarity to reference word embeddings $\{\mathbf{z}_i\}_{k=1}^K$, which are then aggregated to form the proxy word embedding $\mathbf{w}_i$. This proxy word embedding, combined with the image representation $\mathbf{x}_i$, establishes the Aligned Feature Subspace for better alignment between the text and image under the user's interest. In Phase II (Clustering), given the learned proxy word embeddings {$\mathbf{w}_i$} from Phase I to form pseudo-labels, the projection layer of the image encoder is further refined using the clustering loss. In Phase I, both the latent factor $\mathbf{p}$ and the projection layer learn 100 epochs, after which the projection layer further learns 10 epochs using the clustering loss in Phase II. This alternative process repeats until convergence.
  • Figure 3: Visualization of feature embeddings and related labels on Fruit dataset. For the visualization of color, red, green, and yellow points indicate the color of red, green, and yellow, respectively. For the visualization of species, red, yellow, and purple points indicate the species of apple, banana, and grapes, respectively.
  • Figure 4: Sensitivity analysis of balancing factor $\lambda$ on CIFAR-10 dataset.