Table of Contents
Fetching ...

Visual-Semantic Graph Matching Net for Zero-Shot Learning

Bowen Duan, Shiming Chen, Yufei Guo, Guo-Sen Xie, Weiping Ding, Yisong Wang

TL;DR

A visual-semantic graph matching net (VSGMN), which leverages semantic relationships among classes to aid in visual-semantic embedding and achieves superior performance in both conventional and generalized ZSL scenarios.

Abstract

Zero-shot learning (ZSL) aims to leverage additional semantic information to recognize unseen classes. To transfer knowledge from seen to unseen classes, most ZSL methods often learn a shared embedding space by simply aligning visual embeddings with semantic prototypes. However, methods trained under this paradigm often struggle to learn robust embedding space because they align the two modalities in an isolated manner among classes, which ignore the crucial class relationship during the alignment process. To address the aforementioned challenges, this paper proposes a Visual-Semantic Graph Matching Net, termed as VSGMN, which leverages semantic relationships among classes to aid in visual-semantic embedding. VSGMN employs a Graph Build Network (GBN) and a Graph Matching Network (GMN) to achieve two-stage visual-semantic alignment. Specifically, GBN first utilizes an embedding-based approach to build visual and semantic graphs in the semantic space and align the embedding with its prototype for first-stage alignment. Additionally, to supplement unseen class relations in these graphs, GBN also build the unseen class nodes based on semantic relationships. In the second stage, GMN continuously integrates neighbor and cross-graph information into the constructed graph nodes, and aligns the node relationships between the two graphs under the class relationship constraint. Extensive experiments on three benchmark datasets demonstrate that VSGMN achieves superior performance in both conventional and generalized ZSL scenarios. The implementation of our VSGMN and experimental results are available at github: https://github.com/dbwfd/VSGMN

Visual-Semantic Graph Matching Net for Zero-Shot Learning

TL;DR

A visual-semantic graph matching net (VSGMN), which leverages semantic relationships among classes to aid in visual-semantic embedding and achieves superior performance in both conventional and generalized ZSL scenarios.

Abstract

Zero-shot learning (ZSL) aims to leverage additional semantic information to recognize unseen classes. To transfer knowledge from seen to unseen classes, most ZSL methods often learn a shared embedding space by simply aligning visual embeddings with semantic prototypes. However, methods trained under this paradigm often struggle to learn robust embedding space because they align the two modalities in an isolated manner among classes, which ignore the crucial class relationship during the alignment process. To address the aforementioned challenges, this paper proposes a Visual-Semantic Graph Matching Net, termed as VSGMN, which leverages semantic relationships among classes to aid in visual-semantic embedding. VSGMN employs a Graph Build Network (GBN) and a Graph Matching Network (GMN) to achieve two-stage visual-semantic alignment. Specifically, GBN first utilizes an embedding-based approach to build visual and semantic graphs in the semantic space and align the embedding with its prototype for first-stage alignment. Additionally, to supplement unseen class relations in these graphs, GBN also build the unseen class nodes based on semantic relationships. In the second stage, GMN continuously integrates neighbor and cross-graph information into the constructed graph nodes, and aligns the node relationships between the two graphs under the class relationship constraint. Extensive experiments on three benchmark datasets demonstrate that VSGMN achieves superior performance in both conventional and generalized ZSL scenarios. The implementation of our VSGMN and experimental results are available at github: https://github.com/dbwfd/VSGMN

Paper Structure

This paper contains 38 sections, 16 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivate illustration. (a) Most existing embedding-based methods treat semantic vectors solely as classifiers, neglecting the crucial inter-class information ($e.g.$, the relationship between a cat and a lion is much closer than the relationship between a cat and a bird) inherent in semantic vectors. (b) Existing methods based on category relationships, although attempting to explore category relationship information in semantic vectors, often confine this utilization to the semantic space. (c) Our VSGMN not only utilizes the category relationship information in semantic vectors but also transfers this information from the semantic space to the visual space. This enables us to impose class-level relationship constraints on the augmentation of visual features and space mappings.
  • Figure 2: The architecture of the proposed VSGMN model. VSGMN consists of a GBN and a GMN. GBN aims to build the visual graph and semantic graph in the semantic space. To ensure the dimensions of node representations are the same for both, and to achieve the first-stage alignment between vision and semantics, we use the visual-semantic embedding network ($e.g.$, TransZero chen2022transzero) to bridge the two spaces. Additionally, we propose a method to build virtual features for unseen classes in this stage to utilize the relationships among unseen classes and mask the generated virtual embeddings to prevent interference from noise. GMN constrains the training processes of GBN by matching the category relationship between visual embeddings and semantic prototypes.
  • Figure 3: The first-stage visual-semantic alignment, the second-stage visual-semantic alignment and the optimization effect achieved by simultaneously using both constraint methods. The shaded area represents the region where samples of this category may appear, while the dashed line indicates the inter-class relationships that need to be aligned with each other between the two graphs.
  • Figure 4: t-SNE visualizationvan2008visualizing of visual features in semantic space and visual space learned by our VSGMN and Baseline for the same seen and unseen classes. Different colors denote different classes. We conduct experiments on 20 classes of CUB. The baseline model, while capable of finding reasonably appropriate semantic space and visual space for samples, exhibits a certain degree of category confusion, particularly evident in unseen classes. However, VSGMN notably alleviates this condition, especially concerning unseen classes.
  • Figure 5: : The effects of different architectures for the GMN on (a) AWA2 and (b) CUB. We investigate the number of graph match layers for propagation-based implementation and attention-based implementation.
  • ...and 1 more figures