Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He; Xianghan Meng; Zhiyuan Huang; Xianbiao Qi; Rong Xiao; Chun-Guang Li

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li

TL;DR

A novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships is proposed.

Abstract

Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

TL;DR

A novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR

-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships is proposed.

Abstract

-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

Paper Structure (25 sections, 14 equations, 13 figures, 16 tables)

This paper contains 25 sections, 14 equations, 13 figures, 16 tables.

Introduction
Preliminaries
Problem Notation
Multi-modal GCD Pipelines
Our Proposed Approach: SSR$^2$-GCD
Retrieval-based Text Aggregation
Semi-supervised Rate Reduction Modules for Representation Learning
Dual-Branch Clustering
Experiments
Performance on Benchmark Datasets
Evaluation on Representation Learning
Ablation Study
More Evaluations
Conclusion
Experimental Details
...and 10 more sections

Figures (13)

Figure 1: Illustration of Our Proposed Framework: SSR$^2$-GCD.
Figure 2: Distribution of pairwise similarities on the Flowers102 dataset (a): at the beginning of training, (b)-(c): training with $\mathcal{L}_\mathrm{SSR^2}$ at epochs 10 and 200, and (d)-(e): training with $\mathcal{L}_\text{CLIP}$ at epochs 10 and 200.
Figure 3: $R_e$ curves with different losses on (a)-(b): Flowers102 and (c)-(d): Stanford Cars datasets.
Figure 4: Effective ranks of image embeddings in (a)-(b): Oxford Pets, and (c)-(d): Flowers102 dataets.
Figure 5: Visualizations of image embeddings via different representation learning methods.
...and 8 more figures

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

TL;DR

Abstract

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (13)