Modeling Multi-modal Cross-interaction for Multi-label Few-shot Image Classification Based on Local Feature Selection
Kun Yan, Zied Bouraoui, Fangyun Wei, Chang Xu, Ping Wang, Shoaib Jameel, Steven Schockaert
TL;DR
This work tackles multi-label few-shot image classification by leveraging word embeddings to seed label prototypes and refining them with a Loss Change Measurement (LCM) that selects representative local image features. A novel multi-modal cross-interaction module fuses word embedding priors with visual features through channel-wise cross-attention and word-embedding-based dynamic convolution, producing robust label prototypes without requiring fine-tuning for unseen labels. The approach includes a cross-modality alignment loss and a query-based objective, and is evaluated on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist with strong gains over state-of-the-art baselines, including in 5-shot and zero-shot scenarios. The results demonstrate improved prototype quality, effective handling of localized objects, and practical potential for scalable ML-FSIC across diverse datasets and backbone choices.
Abstract
The aim of multi-label few-shot image classification (ML-FSIC) is to assign semantic labels to images, in settings where only a small number of training examples are available for each label. A key feature of the multi-label setting is that an image often has several labels, which typically refer to objects appearing in different regions of the image. When estimating label prototypes, in a metric-based setting, it is thus important to determine which regions are relevant for which labels, but the limited amount of training data and the noisy nature of local features make this highly challenging. As a solution, we propose a strategy in which label prototypes are gradually refined. First, we initialize the prototypes using word embeddings, which allows us to leverage prior knowledge about the meaning of the labels. Second, taking advantage of these initial prototypes, we then use a Loss Change Measurement (LCM) strategy to select the local features from the training images (i.e. the support set) that are most likely to be representative of a given label. Third, we construct the final prototype of the label by aggregating these representative local features using a multi-modal cross-interaction mechanism, which again relies on the initial word embedding-based prototypes. Experiments on COCO, PASCAL VOC, NUS-WIDE, and iMaterialist show that our model substantially improves the current state-of-the-art.
