Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning
Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan
TL;DR
This work tackles zero-shot learning by addressing the misalignment between visual features and semantic attributes in standard backbones. It introduces ZSLViT, a Vision Transformer augmented with semantic-embedded token learning and visual enhancement to progressively discover and retain semantic-related visual representations while discarding unrelated visual cues, guided by semantic prototypes. The approach uses cross-space reconstruction losses $\mathcal{L}_{SR}$ and $\mathcal{L}_{VR}$, and a semantic projection $\phi(x) = Token[cls]^T W_{V2S}$, optimized via a cross-entropy loss plus a SET objective, yielding robust visual-semantic interactions. Empirically, ZSLViT achieves state-of-the-art results on CUB, SUN, and AWA2 under both CZSL and GZSL settings, highlighting the effectiveness of semantic-guided ViT architectures for ZSL applications.
Abstract
Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
