Table of Contents
Fetching ...

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li

TL;DR

A novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships is proposed.

Abstract

Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.

Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

TL;DR

A novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships is proposed.

Abstract

Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
Paper Structure (25 sections, 14 equations, 13 figures, 16 tables)

This paper contains 25 sections, 14 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Illustration of Our Proposed Framework: SSR$^2$-GCD.
  • Figure 2: Distribution of pairwise similarities on the Flowers102 dataset (a): at the beginning of training, (b)-(c): training with $\mathcal{L}_\mathrm{SSR^2}$ at epochs 10 and 200, and (d)-(e): training with $\mathcal{L}_\text{CLIP}$ at epochs 10 and 200.
  • Figure 3: $R_e$ curves with different losses on (a)-(b): Flowers102 and (c)-(d): Stanford Cars datasets.
  • Figure 4: Effective ranks of image embeddings in (a)-(b): Oxford Pets, and (c)-(d): Flowers102 dataets.
  • Figure 5: Visualizations of image embeddings via different representation learning methods.
  • ...and 8 more figures