Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Jiawei Yao; Qi Qian; Juhua Hu

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Jiawei Yao, Qi Qian, Juhua Hu

TL;DR

This paper tackles the problem of selecting among multiple clustering outcomes by enabling personalization through user-provided keywords. It introduces Multi-MaP, a framework that leverages frozen CLIP encoders and GPT-4-generated reference words to learn a text proxy that aligns with a user’s interest, and it enforces concept-level and reference-word constraints to guide the learning process. A key theoretical insight shows that using the nearest reference token bounds the approximation error when mapping continuous proxy words to discrete CLIP tokens. Empirically, Multi-MaP achieves state-of-the-art performance across diverse visual multi-clustering tasks, outperforms zero-shot CLIP baselines, and demonstrates the benefits of combining reference words, user concepts, and contrastive learning for personalized clustering.

Abstract

Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at https://github.com/Alexander-Yao/Multi-MaP.

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

TL;DR

Abstract

Paper Structure (23 sections, 2 theorems, 11 equations, 5 figures, 7 tables)

This paper contains 23 sections, 2 theorems, 11 equations, 5 figures, 7 tables.

Introduction
Related Work
Multiple Clustering
Multi-modal Model
The Proposed Method
Multi-modal Pre-training
Multi-modal Proxy Learning
Remark
Concept-level Constraint
Constrained Optimization with Reference Word
Contrastive Concepts
Experiments
Experiment Setup
Performance Comparison
Ablation Study
...and 8 more sections

Key Result

Theorem 1

Given $w\not\in T$ and $t\in T$, if assuming $h'$ and $H$ are $L_h$ and $L_H$-Lipschitz continuous, we have

Figures (5)

Figure 1: The flow chart of Multi-MaP. Multi-MaP obtains multiple clustering results based on the high-level concepts from users and the reference words from GPT-4.
Figure 2: Multi-MaP framework. In the training process of Multi-MaP, the vision and text encoders are frozen and the proxy word embeddings $\mathbf{w}_i$ are learnable. Specifically, it first constructs the prompt embeddings based on the reference words provided by GPT-4 using a user's high-level concept, and then selects a reference word $z_i$ for each image according to the similarity between the prompt embeddings $\mathbf{t}_i$ and the image embeddings $\mathbf{x}_i$. Then, it combines the prompt and the reference words to form the new prompt embeddings $\mathbf{t}_i^*$ and maximizes the similarity to the image representation, so the proxy word embeddings $\mathbf{w}_i$ can capture the desired image features. In addition, the proxy word embeddings $\mathbf{w}_i$ should be close to the target concept word $\mathbf{u_1}$ and the selected reference word $\mathbf{z}_i$ to construct the concept-level constraint and reference word constraint, which capture the features related to the user's interest.
Figure 3: Parameter analysis of $\alpha$ and $\beta$ on Fruit hu2017finding.
Figure 4: Visualization of feature embeddings and related labels. The points represent the image or pseudo-word embeddings, and the triangles represent the prompt or label embeddings. Different colors represent different labels, which are indicated by the text next to the triangles.
Figure 5: Performance vs. the running time on Fruit hu2017finding dataset.

Theorems & Definitions (3)

Theorem 1
proof
Corollary 2

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

TL;DR

Abstract

Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (3)