Multimodal Generalized Category Discovery

Yuchang Su; Renping Zhou; Siyu Huang; Xingjian Li; Tianyang Wang; Ziyue Wang; Min Xu

Multimodal Generalized Category Discovery

Yuchang Su, Renping Zhou, Siyu Huang, Xingjian Li, Tianyang Wang, Ziyue Wang, Min Xu

TL;DR

This work extends Generalized Category Discovery to the multimodal setting and introduces MM-GCD, a dual-branch framework that aligns both feature and output spaces across modalities via multimodal contrastive learning and distillation-based losses. By constructing a robust cross-modal embedding space and a trainable prototypical classifier, MM-GCD achieves state-of-the-art results on Food101 and N24News, with significant gains over prior methods and clear evidence that modality alignment reduces variance and improves novel category discovery. The approach is supported by theoretical analysis and extensive ablations, showing that joint alignment and fusion-based predictions robustly leverage cross-modal information for open-world classification. The work also provides supplementary analyses and visualizations that illustrate reduced prediction bias and improved attention focusing when cross-modal cues are utilized. Overall, MM-GCD demonstrates the practical value of multimodal data for open-world discovery and offers a comprehensive framework for future multimodal GCD research.

Abstract

Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5\% and 4.7\%, respectively.

Multimodal Generalized Category Discovery

TL;DR

Abstract

Paper Structure (19 sections, 19 equations, 8 figures, 7 tables)

This paper contains 19 sections, 19 equations, 8 figures, 7 tables.

Introduction
Related Works
Problem Formulation
Theoretical Analysis
Method
What is required to discover a new category?
Constructing Embedding Space
Partitioning Embedding Space
Results
Experiments Setup
Main Results
Ablation Study
Conclusion
Supplementary
Analysis of Prediction Bias
...and 4 more sections

Figures (8)

Figure 1: Evolution from NCD to Multimodal GCD. (a) Novel Category Discovery (NCD) dealt with unlabelled images containing only new classes. (b) Generalized Category Discovery (GCD) expanded this by including possible old classes in the unlabelled set but was limited to single modality data. (c) Our multimodal-GCD model addresses these limitations by focusing on multimodal data which is now abundantly present in real life, and leveraging inter-modality interactions to improve learning where labels are missing.
Figure 2: Overview of our MM-GCD framework: We propose a dual-branch structure that processes text and image data separately, calculating unimodal loss to capture category distinctions. Our framework focuses on aligning the feature space through multimodal contrastive learning and optimizing the output space by entropy minimization for consistent decision-making across modalities.
Figure 3: The relationship between accuracy and feature similarity across visual and text modalities. Result shows that groups with higher feature similarity tend to achieve greater accuracy.
Figure 4: Ablation study for different form of alignment. FA: add feature space alignment using multimodal contrastive learning. OA: add output space alignment using cross-modal distillation
Figure 5: t-SNE result in Food101. From left to right: (a) Using the baseline methods, present unimodal GCD loss with image, text, and fused features; (b) Incorporating multimodal contrastive learning (feature align); (c) Adding multimodal prototype distillation (output align). Dots and triangles represent text and image respectively.
...and 3 more figures

Multimodal Generalized Category Discovery

TL;DR

Abstract

Multimodal Generalized Category Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (8)