Table of Contents
Fetching ...

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

Shiming Chen, Wenjin Hou, Salman Khan, Fahad Shahbaz Khan

TL;DR

This work tackles zero-shot learning by addressing the misalignment between visual features and semantic attributes in standard backbones. It introduces ZSLViT, a Vision Transformer augmented with semantic-embedded token learning and visual enhancement to progressively discover and retain semantic-related visual representations while discarding unrelated visual cues, guided by semantic prototypes. The approach uses cross-space reconstruction losses $\mathcal{L}_{SR}$ and $\mathcal{L}_{VR}$, and a semantic projection $\phi(x) = Token[cls]^T W_{V2S}$, optimized via a cross-entropy loss plus a SET objective, yielding robust visual-semantic interactions. Empirically, ZSLViT achieves state-of-the-art results on CUB, SUN, and AWA2 under both CZSL and GZSL settings, highlighting the effectiveness of semantic-guided ViT architectures for ZSL applications.

Abstract

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

TL;DR

This work tackles zero-shot learning by addressing the misalignment between visual features and semantic attributes in standard backbones. It introduces ZSLViT, a Vision Transformer augmented with semantic-embedded token learning and visual enhancement to progressively discover and retain semantic-related visual representations while discarding unrelated visual cues, guided by semantic prototypes. The approach uses cross-space reconstruction losses and , and a semantic projection , optimized via a cross-entropy loss plus a SET objective, yielding robust visual-semantic interactions. Empirically, ZSLViT achieves state-of-the-art results on CUB, SUN, and AWA2 under both CZSL and GZSL settings, highlighting the effectiveness of semantic-guided ViT architectures for ZSL applications.

Abstract

Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones, supported by semantic information (e.g., attributes). However, existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e., CNN or ViT), which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information, resulting in undesirable visual-semantic interactions. To tackle this issue, we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly, and ii) discard the semantic-unrelated visual information. Specifically, we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets, i.e., CUB, SUN, and AWA2. Codes are available at: https://github.com/shiming-chen/ZSLViT .
Paper Structure (10 sections, 9 equations, 5 figures, 2 tables)

This paper contains 10 sections, 9 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Motivation Illustration. (a) Existing ZSL methods simply take the pre-trained network backbone (i.e., CNN or ViT) to extract visual features. (b) Our ZSLViT progressively learns semantic-visual correspondences to represent semantic-related visual features in the whole network for advancing ZSL. (c) The visual feature visualization. (c1) The heat map of visual features learned by CNN backbone (e.g., ResNet101 He2016DeepRL) includes the whole object and background, which fail to capture the semantic attributes. (c2) The attention map of visual features learned by the standard ViT Dosovitskiy2020AnII, which localizes the semantic attributes incorrectly. (c3) The attention map learned by our ZSLViT, which discovers the semantic-related visual representations and discards the semantic-unrelated visual information according to semantic-visual correspondences.
  • Figure 2: A single ZSLViT encoder. ZSLViT encoder includes a semantic-embedded token learning (SET) and a visual enhancement (ViE) between the multi-head self-attention and feed-forward network layers. SET improves the visual-semantic correspondences via semantic enhancement and discovers the semantic-related visual tokens explicitly with semantic-guided token attention. ViE fuses the visual tokens of low visual-semantic correspondences to discard the semantic-unrelated visual information for visual tokens enhancement. The ZSLViT encoder are integrated into various layers to progressively learn semantic-related visual representations, enabling effective visual-semantic interactions for ZSL.
  • Figure 3: Visualizations of attention mask and map of our ZSLViT in various layers. The masked regions represent the semantic-unrelated visual tokens with low visual-semantic correspondences, which are fused into a new token for subsequent learning. The highlighted attention maps are the semantic-related visual tokens with high visual-semantic correspondences, which are preserved to next layer. Results show that ZSLViT can accurately identify the semantic-relateds/unrelated visual tokens in images for visual enhancement.
  • Figure 4: t-SNE visualizations of visual features for (a) seen classes and (b) unseen classes, learned by the CNN backbone (e.g., ResNet101 He2016DeepRL), standard ViT Touvron2021TrainingDI and our ZSLViT. The 10 colors denote 10 different seen/unseen classes randomly selected from CUB. (Best viewed in color)
  • Figure 5: The effects of (a) embedding coefficient $\gamma$, (b) fusing rate $\kappa$, (c) loss weights $\lambda_{VR}$, and (d) loss weights $\lambda_{SR}$. We take CUB as an example.