Table of Contents
Fetching ...

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning

Huajie Jiang, Zhengxian Li, Xiaohan Yu, Yongli Hu, Baocai Yin, Jian Yang, Yuankai Qi

TL;DR

This work tackles generalized zero-shot learning by addressing visual-semantic misalignment through a prompt-based adaptation of a pre-trained Vision Transformer. It introduces the Visual and Semantic Prompt Collaboration Network (VSPCN), which embeds both visual and semantic prompts and employs weak fusion in shallow layers and strong fusion in deep layers to foster semantic-aware visual features. The model is optimized with a combination of base classification and alignment losses ($L_{BASE}$, $L_{CED}$, $L_{SKD}$) and uses an inference calibration to balance seen and unseen predictions, aided by semantic adapters for instance-specific semantic refinement. Experiments on CUB, SUN, and AWA2 show state-of-the-art performance in both conventional ZSL and generalized ZSL settings, demonstrating improved generalization and robust visual-semantic alignment with a parameter-efficient prompt-learning approach.

Abstract

Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visualsemantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning

TL;DR

This work tackles generalized zero-shot learning by addressing visual-semantic misalignment through a prompt-based adaptation of a pre-trained Vision Transformer. It introduces the Visual and Semantic Prompt Collaboration Network (VSPCN), which embeds both visual and semantic prompts and employs weak fusion in shallow layers and strong fusion in deep layers to foster semantic-aware visual features. The model is optimized with a combination of base classification and alignment losses (, , ) and uses an inference calibration to balance seen and unseen predictions, aided by semantic adapters for instance-specific semantic refinement. Experiments on CUB, SUN, and AWA2 show state-of-the-art performance in both conventional ZSL and generalized ZSL settings, demonstrating improved generalization and robust visual-semantic alignment with a parameter-efficient prompt-learning approach.

Abstract

Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visualsemantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.

Paper Structure

This paper contains 13 sections, 19 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Different paradigms of GZSL. (a) Visual-semantic alignment with fixed backbone. (b) Fine-tuning visual features. (c) Fine-tuning with transformer backbone. (d) Our visual and semantic prompts collaboration network (VSPCN). The VSPCN integrates visual information and semantic information into the intermediate layers of the vision transformer network for semantic-related visual feature learning.
  • Figure 2: The framework of visual semantic prompts collaboration network (VSPCN). VSPCN utilizes the collaboration of visual and semantic prompts to adapt the pre-trained ViT model to the GZSL task. The visual prompt learns to extract visual information from image tokens, and the semantic prompt incorporates semantic information from semantic attributes. Weak prompt fusion (including weak visual prompt fusion (WVPF) and weak semantic prompt fusion (WSPF)) (\ref{['Weak Prompts Collaboration']}) and strong prompt fusion (including strong visual prompt fusion (SVPF) and strong semantic prompt fusion (SSPF)) (\ref{['Strong Prompts Collaboration']}) mechanisms are designed to integrate visual and semantic information. Furthermore, the adapters are utilized to update the semantic features for instance-level adaptive semantic information extraction (\ref{['Model Optimization and Inference']}).
  • Figure 3: Illustration of our fusion modules. (a) weak visual prompt fusion, (b) strong visual prompt fusion, (c) weak semantic prompt fusion, and (d) strong semantic prompt fusion.
  • Figure 6: Feature visualization for seen classes and unseen classes on CUB by t-SNE. Different colors refer to different classes. We randomly select 10 classes and show the visualization results of different approaches.
  • Figure 7: Visualization of attention maps obtained by CLS token, visual prompt (VP), and semantic prompt (SP), respectively.
  • ...and 8 more figures