Table of Contents
Fetching ...

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

Zhengwei Yang, Yuke Li, Qiang Sun, Basura Fernando, Heng Huang, Zheng Wang

TL;DR

This work addresses cross-modal few-shot learning by proposing Generative Transfer Learning (GTL), a two-stage framework that learns invariant latent concepts z_c while disentangling modality-specific disturbances z_m. The model uses a VAE-style generator and a disturbance encoder with latent domain gating to align multi-modal data, enabling knowledge transfer from abundant unimodal data to scarce multi-modal scenarios. By optimizing a representation ELBO and a classification objective, GTL jointly disentangles cross-modal structure and modality-specific information, achieving state-of-the-art results on seven multi-modal datasets including RGB-Sketch, RGB-Infrared, and RGB-Depth. The approach demonstrates strong generalization across diverse modalities with limited labeled samples, highlighting its potential for real-world multi-modal recognition and cross-domain transfer.

Abstract

Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize to unseen data using a limited amount of labeled examples from a single modality. However, real-world data are inherently multi-modal, and such unimodal approaches limit the practical applications of few-shot learning. To bridge this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances across multiple modalities while relying on scarce labeled data. This task presents unique challenges compared to classical few-shot learning arising from the distinct visual attributes and structural disparities inherent to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework by simulating how humans abstract and generalize concepts. Specifically, the GTL jointly estimates the latent shared concept across modalities and the in-modality disturbance through a generative structure. Establishing the relationship between latent concepts and visual content among abundant unimodal data enables GTL to effectively transfer knowledge from unimodal to novel multimodal data, as humans did. Comprehensive experiments demonstrate that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.

Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework

TL;DR

This work addresses cross-modal few-shot learning by proposing Generative Transfer Learning (GTL), a two-stage framework that learns invariant latent concepts z_c while disentangling modality-specific disturbances z_m. The model uses a VAE-style generator and a disturbance encoder with latent domain gating to align multi-modal data, enabling knowledge transfer from abundant unimodal data to scarce multi-modal scenarios. By optimizing a representation ELBO and a classification objective, GTL jointly disentangles cross-modal structure and modality-specific information, achieving state-of-the-art results on seven multi-modal datasets including RGB-Sketch, RGB-Infrared, and RGB-Depth. The approach demonstrates strong generalization across diverse modalities with limited labeled samples, highlighting its potential for real-world multi-modal recognition and cross-domain transfer.

Abstract

Most existing studies on few-shot learning focus on unimodal settings, where models are trained to generalize to unseen data using a limited amount of labeled examples from a single modality. However, real-world data are inherently multi-modal, and such unimodal approaches limit the practical applications of few-shot learning. To bridge this gap, this paper introduces the Cross-modal Few-Shot Learning (CFSL) task, which aims to recognize instances across multiple modalities while relying on scarce labeled data. This task presents unique challenges compared to classical few-shot learning arising from the distinct visual attributes and structural disparities inherent to each modality. To tackle these challenges, we propose a Generative Transfer Learning (GTL) framework by simulating how humans abstract and generalize concepts. Specifically, the GTL jointly estimates the latent shared concept across modalities and the in-modality disturbance through a generative structure. Establishing the relationship between latent concepts and visual content among abundant unimodal data enables GTL to effectively transfer knowledge from unimodal to novel multimodal data, as humans did. Comprehensive experiments demonstrate that the GTL achieves state-of-the-art performance across seven multi-modal datasets across RGB-Sketch, RGB-Infrared, and RGB-Depth.

Paper Structure

This paper contains 39 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of recognition tasks. (a) Classical recognition requires extensive labeled data within a single modality. (b) Few-shot recognition uses a few labeled samples in a single modality to classify unseen samples. (c) Our proposed CFSL involves few labeled multi-modal samples and aims to generalize to unseen multi-modal samples from the same classes, leveraging both seen and unseen data from different modalities.
  • Figure 2: Illustration of the ability to generalize concepts like “Bull” across visual modalities (Pablo Picasso. The Bull, 1945.).
  • Figure 3: The observation of the severe modality differences and the details of the proposed generative model. (a) Illustration of the modality difference by the t-SNE clustering of the pre-trained CLIP radford2021clip features of different modalities. (b) The proposed generative process for the representation learning stage, the green symbols are assumed to be parameters that enable the models to adapt from base to novel data.
  • Figure 4: The proposed GTL framework. During the training on base data, all modules are trained (as in the blue dashed box), but when adapting to novel data, the generator is frozen, and all other parts are tunable (as in the red dashed box). The classifier for recognition is separately initialized on the base and novel training since there is no overlap on class between them.
  • Figure 5: Experimental results of different shots on testing performance on (a) and (b) Mask1k, and (c) SKSF-A datasets.
  • ...and 4 more figures