Table of Contents
Fetching ...

PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning

Man Liu, Huihui Bai, Feng Li, Chunjie Zhang, Yunchao Wei, Meng Wang, Tat-Seng Chua, Yao Zhao

TL;DR

A multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency, and experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

Abstract

Generalized zero-shot learning (GZSL) endeavors to identify the unseen categories using knowledge from the seen domain, necessitating the intrinsic interactions between the visual features and attribute semantic features. However, GZSL suffers from insufficient visual-semantic correspondences due to the attribute diversity and instance diversity. Attribute diversity refers to varying semantic granularity in attribute descriptions, ranging from low-level (specific, directly observable) to high-level (abstract, highly generic) characteristics. This diversity challenges the collection of adequate visual cues for attributes under a uni-granularity. Additionally, diverse visual instances corresponding to the same sharing attributes introduce semantic ambiguity, leading to vague visual patterns. To tackle these problems, we propose a multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency. PSVMA+ explores semantic-visual interactions at different granularity levels, enabling awareness of multi-granularity in both visual and semantic elements. At each granularity level, the dual semantic-visual transformer module (DSVTM) recasts the sharing attributes into instance-centric attributes and aggregates the semantic-related visual regions, thereby learning unambiguous visual features to accommodate various instances. Given the diverse contributions of different granularities, PSVMA+ employs selective cross-granularity learning to leverage knowledge from reliable granularities and adaptively fuses multi-granularity features for comprehensive representations. Experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

PSVMA+: Exploring Multi-granularity Semantic-visual Adaption for Generalized Zero-shot Learning

TL;DR

A multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency, and experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

Abstract

Generalized zero-shot learning (GZSL) endeavors to identify the unseen categories using knowledge from the seen domain, necessitating the intrinsic interactions between the visual features and attribute semantic features. However, GZSL suffers from insufficient visual-semantic correspondences due to the attribute diversity and instance diversity. Attribute diversity refers to varying semantic granularity in attribute descriptions, ranging from low-level (specific, directly observable) to high-level (abstract, highly generic) characteristics. This diversity challenges the collection of adequate visual cues for attributes under a uni-granularity. Additionally, diverse visual instances corresponding to the same sharing attributes introduce semantic ambiguity, leading to vague visual patterns. To tackle these problems, we propose a multi-granularity progressive semantic-visual mutual adaption (PSVMA+) network, where sufficient visual elements across granularity levels can be gathered to remedy the granularity inconsistency. PSVMA+ explores semantic-visual interactions at different granularity levels, enabling awareness of multi-granularity in both visual and semantic elements. At each granularity level, the dual semantic-visual transformer module (DSVTM) recasts the sharing attributes into instance-centric attributes and aggregates the semantic-related visual regions, thereby learning unambiguous visual features to accommodate various instances. Given the diverse contributions of different granularities, PSVMA+ employs selective cross-granularity learning to leverage knowledge from reliable granularities and adaptively fuses multi-granularity features for comprehensive representations. Experimental results demonstrate that PSVMA+ consistently outperforms state-of-the-art methods.

Paper Structure

This paper contains 25 sections, 21 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: The embedding-based models and visualization for attribute disambiguation for GZSL. (a) Part-based methods via attention mechanisms. (b) Semantic-guided methods. (c) PSVMA. (d) PSVMA+. $\mathcal{A}$, $S$, $F$ denote the category attribute prototypes, sharing attributes, and visual features, respectively. The PSVMA+ achieves the best performances on (e) inter-class disambiguation.
  • Figure 2: The framework of our PSVMA+ model. PSVMA+ explores the semantic-visual interaction under a hierarchical multi-granularity architecture to model sufficient correspondence, mitigating the inconsistency caused by diverse attributes and instances. DSVTM conducts semantic-visual mutual adaptation under each granularity, yielding unambiguous granularity-specific visual features. Then, SCGL actively selects the reliable granularity to guide the refinement of the unreliable one, which also encourages the emphasis on the challenging samples positioned near the decision boundaries. Finally, AMGF augments the category decision-making process by dynamically fusing the multi-granularity features.
  • Figure 3: DSVTM is a transformer-based structure that contains the IMSE and SMID, pursuing semantic and visual mutual adaption for the alleviation of semantic ambiguity. The IMSE in DSVTM progressively learns the instance-centric semantics to acquire a matched semantic-visual pair. The SMID in DSVTM constructs accurate interactions and learns unambiguous visual representations.
  • Figure 4: t-SNE visualizations of visual features for seen classes and unseen classes, learned by the (a) ViT backbone, (b) uni-granularity, (c) bi-granularities, (d) PSVMA+ w/o SCGL, (e) PSVMA+ w/o AMGF, and (f) our full PSVMA+. The 10 colors denote 10 different seen/unseen classes randomly selected from CUB.
  • Figure 5: Effect of granularity number on (a) CUB, (b) SUN, and (c) AwA2 datasets.
  • ...and 3 more figures