Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning
Huajie Jiang, Zhengxian Li, Xiaohan Yu, Yongli Hu, Baocai Yin, Jian Yang, Yuankai Qi
TL;DR
This work tackles generalized zero-shot learning by addressing visual-semantic misalignment through a prompt-based adaptation of a pre-trained Vision Transformer. It introduces the Visual and Semantic Prompt Collaboration Network (VSPCN), which embeds both visual and semantic prompts and employs weak fusion in shallow layers and strong fusion in deep layers to foster semantic-aware visual features. The model is optimized with a combination of base classification and alignment losses ($L_{BASE}$, $L_{CED}$, $L_{SKD}$) and uses an inference calibration to balance seen and unseen predictions, aided by semantic adapters for instance-specific semantic refinement. Experiments on CUB, SUN, and AWA2 show state-of-the-art performance in both conventional ZSL and generalized ZSL settings, demonstrating improved generalization and robust visual-semantic alignment with a parameter-efficient prompt-learning approach.
Abstract
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. It inevitably requires consistent visual-semantic alignment. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features, which may cause overfitting on seen classes with a limited number of training images. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation. Specifically, we design a visual prompt to integrate the visual information for discriminative feature learning and a semantic prompt to integrate the semantic formation for visualsemantic alignment. To achieve effective prompt information integration, we further design a weak prompt fusion mechanism for the shallow layers and a strong prompt fusion mechanism for the deep layers in the network. Through the collaboration of visual and semantic prompts, we can obtain discriminative semantic-related features for generalized zero-shot image recognition. Extensive experiments demonstrate that our framework consistently achieves favorable performance in both conventional zero-shot learning and generalized zero-shot learning benchmarks compared to other state-of-the-art methods.
