Table of Contents
Fetching ...

Modeling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection

Kun Yan, Zied Bouraoui, Fangyun Wei, Chang Xu, Ping Wang, Shoaib Jameel, Steven Schockaert

TL;DR

This work tackles multi-label few-shot image classification by leveraging word embeddings to seed label prototypes and refining them with a Loss Change Measurement (LCM) that selects representative local image features. A novel multi-modal cross-interaction module fuses word embedding priors with visual features through channel-wise cross-attention and word-embedding-based dynamic convolution, producing robust label prototypes without requiring fine-tuning for unseen labels. The approach includes a cross-modality alignment loss and a query-based objective, and is evaluated on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist with strong gains over state-of-the-art baselines, including in 5-shot and zero-shot scenarios. The results demonstrate improved prototype quality, effective handling of localized objects, and practical potential for scalable ML-FSIC across diverse datasets and backbone choices.

Abstract

The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.

Modeling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection

TL;DR

This work tackles multi-label few-shot image classification by leveraging word embeddings to seed label prototypes and refining them with a Loss Change Measurement (LCM) that selects representative local image features. A novel multi-modal cross-interaction module fuses word embedding priors with visual features through channel-wise cross-attention and word-embedding-based dynamic convolution, producing robust label prototypes without requiring fine-tuning for unseen labels. The approach includes a cross-modality alignment loss and a query-based objective, and is evaluated on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist with strong gains over state-of-the-art baselines, including in 5-shot and zero-shot scenarios. The results demonstrate improved prototype quality, effective handling of localized objects, and practical potential for scalable ML-FSIC across diverse datasets and backbone choices.

Abstract

The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.

Paper Structure

This paper contains 43 sections, 18 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: In the 1-shot single-label setting, a given training image can be interpreted as a prototype for the considered label (left). In the multi-label setting, labels are related to different regions of the image, and these regions need to be identified before meaningful prototypes can be obtained (right).
  • Figure 2: Overview of our methodology. A joint embedding space is learned in which both labels and images are represented. A customized multi-modal cross-interaction strategy is proposed to calculate label prototypes using local features from the relevant images in the support set, along with word vectors that provide prior knowledge about the considered label. In our visual representation, the green solid line indicates the local feature flow for the base model, while the green dotted line represents the flow for the LCM model.
  • Figure 3: Ablation study for the number of attention heads and threshold $\theta$.
  • Figure 4: Visualization of attention weights among local features used to construct label prototypes for large objects. Darker areas indicate higher importance of the corresponding local features.
  • Figure 5: Visualization of the attention weights among local features for constructing label prototypes, using GloVe and mirrorBERT word vectors, across various categories. The color intensity indicates the relative importance of the corresponding local feature.
  • ...and 1 more figures