Table of Contents
Fetching ...

Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification

Xuelin Zhu, Jian Liu, Dongqi Tang, Jiawei Ge, Weijia Liu, Bo Liu, Jiuxin Cao

TL;DR

Open-vocabulary multi-label classification faces challenges in unseen-label recognition and underutilization of cross-modal knowledge. The authors propose Query-based Knowledge Sharing (QKS), a framework that freezes a vision-language pre-training (VLP) backbone and introduces a knowledge extraction module with learnable, label-agnostic query tokens, plus a knowledge sharing module that aligns shared visual clues with label embeddings. A prompt pool enriches label embeddings, and ranking is reformulated as a classification objective to leverage vector magnitudes in matching. Across NUS-WIDE and Open Images, QKS significantly outperforms state-of-the-art methods, achieving notable gains in mAP and F1 on both ZSL and GZSL tasks, demonstrating strong practical impact for open-vocabulary recognition.

Abstract

Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.

Query-Based Knowledge Sharing for Open-Vocabulary Multi-Label Classification

TL;DR

Open-vocabulary multi-label classification faces challenges in unseen-label recognition and underutilization of cross-modal knowledge. The authors propose Query-based Knowledge Sharing (QKS), a framework that freezes a vision-language pre-training (VLP) backbone and introduces a knowledge extraction module with learnable, label-agnostic query tokens, plus a knowledge sharing module that aligns shared visual clues with label embeddings. A prompt pool enriches label embeddings, and ranking is reformulated as a classification objective to leverage vector magnitudes in matching. Across NUS-WIDE and Open Images, QKS significantly outperforms state-of-the-art methods, achieving notable gains in mAP and F1 on both ZSL and GZSL tasks, demonstrating strong practical impact for open-vocabulary recognition.

Abstract

Identifying labels that did not appear during training, known as multi-label zero-shot learning, is a non-trivial task in computer vision. To this end, recent studies have attempted to explore the multi-modal knowledge of vision-language pre-training (VLP) models by knowledge distillation, allowing to recognize unseen labels in an open-vocabulary manner. However, experimental evidence shows that knowledge distillation is suboptimal and provides limited performance gain in unseen label prediction. In this paper, a novel query-based knowledge sharing paradigm is proposed to explore the multi-modal knowledge from the pretrained VLP model for open-vocabulary multi-label classification. Specifically, a set of learnable label-agnostic query tokens is trained to extract critical vision knowledge from the input image, and further shared across all labels, allowing them to select tokens of interest as visual clues for recognition. Besides, we propose an effective prompt pool for robust label embedding, and reformulate the standard ranking learning into a form of classification to allow the magnitude of feature vectors for matching, which both significantly benefit label recognition. Experimental results show that our framework significantly outperforms state-of-the-art methods on zero-shot task by 5.9% and 4.5% in mAP on the NUS-WIDE and Open Images, respectively.
Paper Structure (24 sections, 7 equations, 7 figures, 5 tables)

This paper contains 24 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A brief comparison on paradigms of exploring pretrained vision-language models for the open-vocabulary multi-label classification task. (a) MKT relies on knowledge distillation to preserve the image-text matching ability of the VLP model and performs ranking learning for label recognition. (b) Our QKS takes the VLP model as part of the framework and designs a vision knowledge extraction module to explore crucial and informative vision features for matching with label embeddings by classification learning.
  • Figure 2: The detailed illustration of the proposed QKS framework. It takes a frozen VLP model as foundation followed by a knowledge extraction module and a knowledge sharing module. The former employs a set of label-agnostic query tokens to aggregate crucial and informative knowledge from the spatial features encoded by the VLP vision encoder, while the latter allows label embeddings encoded by the VLP language encoder to select tokens of interest as visual clues for recognition.
  • Figure 3: Effect of the hyper-parameters. The AVG (red curve) is the average score of mAP, Top-3 F1 and Top-5 F1.
  • Figure 4: Visualization of the distribution of 81 unseen labels' preferences for 12 query tokens on the NUS-WIDE testing set.
  • Figure 5: Distribution of the number of positive labels with the maximum matching score for each query token.
  • ...and 2 more figures