Table of Contents
Fetching ...

`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for Generalized Zero-Shot Learning

Joshua Feinglass, Jayaraman J. Thiagarajan, Rushil Anirudh, T. S. Jayram, Yezhou Yang

TL;DR

The paper tackles Generalized Zero-Shot Learning by replacing global attribute vectors with region-level, attribute-aware representations derived from a pre-trained Vision-Language detector (VINVL). It introduces the Part Prototype Network (PPN), which constructs region-specific class representations using a semantic-class attribute tensor and region embeddings, and aggregates region compatibilities into class scores with a calibrated softmax. Regularizers encouraging unseen-relevant attributes and a cosine-based visual-semantic alignment, plus a multiplicative calibrated stacking post-processing, improve GZSL performance on CUB, SUN, and AWA2. The approach demonstrates that localized region proposals provide a practical, single-stage foundation for robust zero-shot classification and offers a scalable baseline for future improvements with VINVL-enhanced features and localization-centric architectures.

Abstract

Current approaches in Generalized Zero-Shot Learning (GZSL) are built upon base models which consider only a single class attribute vector representation over the entire image. This is an oversimplification of the process of novel category recognition, where different regions of the image may have properties from different seen classes and thus have different predominant attributes. With this in mind, we take a fundamentally different approach: a pre-trained Vision-Language detector (VINVL) sensitive to attribute information is employed to efficiently obtain region features. A learned function maps the region features to region-specific attribute attention used to construct class part prototypes. We conduct experiments on a popular GZSL benchmark consisting of the CUB, SUN, and AWA2 datasets where our proposed Part Prototype Network (PPN) achieves promising results when compared with other popular base models. Corresponding ablation studies and analysis show that our approach is highly practical and has a distinct advantage over global attribute attention when localized proposals are available.

`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for Generalized Zero-Shot Learning

TL;DR

The paper tackles Generalized Zero-Shot Learning by replacing global attribute vectors with region-level, attribute-aware representations derived from a pre-trained Vision-Language detector (VINVL). It introduces the Part Prototype Network (PPN), which constructs region-specific class representations using a semantic-class attribute tensor and region embeddings, and aggregates region compatibilities into class scores with a calibrated softmax. Regularizers encouraging unseen-relevant attributes and a cosine-based visual-semantic alignment, plus a multiplicative calibrated stacking post-processing, improve GZSL performance on CUB, SUN, and AWA2. The approach demonstrates that localized region proposals provide a practical, single-stage foundation for robust zero-shot classification and offers a scalable baseline for future improvements with VINVL-enhanced features and localization-centric architectures.

Abstract

Current approaches in Generalized Zero-Shot Learning (GZSL) are built upon base models which consider only a single class attribute vector representation over the entire image. This is an oversimplification of the process of novel category recognition, where different regions of the image may have properties from different seen classes and thus have different predominant attributes. With this in mind, we take a fundamentally different approach: a pre-trained Vision-Language detector (VINVL) sensitive to attribute information is employed to efficiently obtain region features. A learned function maps the region features to region-specific attribute attention used to construct class part prototypes. We conduct experiments on a popular GZSL benchmark consisting of the CUB, SUN, and AWA2 datasets where our proposed Part Prototype Network (PPN) achieves promising results when compared with other popular base models. Corresponding ablation studies and analysis show that our approach is highly practical and has a distinct advantage over global attribute attention when localized proposals are available.
Paper Structure (15 sections, 11 equations, 3 figures, 1 table)

This paper contains 15 sections, 11 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: A comparison between the proposed Prototype Proposal Network (PPN) approach and existing approaches which utilize global attribute attention like the base model DAZLE dazle2020.
  • Figure 2: A visualization of the proposed Part Proposal Network (PPN) methodology. $\alpha$, $W$, and $\beta$ represent learned parameters and correspond to the part prototypes, regional embedding, and the mapping function for regional attention, respectively.
  • Figure 3: An ablation study of the GZSL harmonic mean performance of DAZLE (with VINVL features) and RAJE when using addition and multiplication for calibrated stacking. The same vertical axes are used when plotting the multiplicative and additive performance in each dataset. Our proposed multiplicative approach for calibration exhibits greater performance over a larger portion of the graph while the previous additive approach has reduced performance and a sharp dip after its peak. Furthermore, additive calibration has the potential to sharply dip as it approaches 1 since it will begin classifying all examples as unseen.