Table of Contents
Fetching ...

Visual Space Optimization for Zero-shot Learning

Xinsheng Wang, Shanmin Pang, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Yaochen Li

TL;DR

This work tackles zero-shot learning by addressing suboptimal visual-space geometry that hampers cross-modal embedding. It introduces two core ideas: (i) learnable visual prototypes $z_i$ to represent each class in the visual space and embed semantic vectors toward these prototypes, and (ii) a visual data structure optimization framework that learns an intermediate embedding space while actively shaping the visual feature topology via a structure-preserving loss and a ranking-based embedding objective. The proposed methods—visual prototypes (VPB) and visual-space structure optimization with simple (SRS) or bi-directional (BRS) ranking losses—achieve state-of-the-art results on four benchmarks, with VPB delivering the strongest generalized zero-shot learning performance. The approach mitigates hubness and overfitting to seen classes by tightening class-discriminative prototypes and enforcing neighborhood structure, enabling robust recognition of unseen categories. Collectively, the work advances practical ZSL by aligning semantic descriptions with a discriminative, well-structured visual space, improving generalization for real-world tasks.

Abstract

Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.

Visual Space Optimization for Zero-shot Learning

TL;DR

This work tackles zero-shot learning by addressing suboptimal visual-space geometry that hampers cross-modal embedding. It introduces two core ideas: (i) learnable visual prototypes to represent each class in the visual space and embed semantic vectors toward these prototypes, and (ii) a visual data structure optimization framework that learns an intermediate embedding space while actively shaping the visual feature topology via a structure-preserving loss and a ranking-based embedding objective. The proposed methods—visual prototypes (VPB) and visual-space structure optimization with simple (SRS) or bi-directional (BRS) ranking losses—achieve state-of-the-art results on four benchmarks, with VPB delivering the strongest generalized zero-shot learning performance. The approach mitigates hubness and overfitting to seen classes by tightening class-discriminative prototypes and enforcing neighborhood structure, enabling robust recognition of unseen categories. Collectively, the work advances practical ZSL by aligning semantic descriptions with a discriminative, well-structured visual space, improving generalization for real-world tasks.

Abstract

Zero-shot learning, which aims to recognize new categories that are not included in the training set, has gained popularity owing to its potential ability in the real-word applications. Zero-shot learning models rely on learning an embedding space, where both semantic descriptions of classes and visual features of instances can be embedded for nearest neighbor search. Recently, most of the existing works consider the visual space formulated by deep visual features as an ideal choice of the embedding space. However, the discrete distribution of instances in the visual space makes the data structure unremarkable. We argue that optimizing the visual space is crucial as it allows semantic vectors to be embedded into the visual space more effectively. In this work, we propose two strategies to accomplish this purpose. One is the visual prototype based method, which learns a visual prototype for each visual class, so that, in the visual space, a class can be represented by a prototype feature instead of a series of discrete visual features. The other is to optimize the visual feature structure in an intermediate embedding space, and in this method we successfully devise a multilayer perceptron framework based algorithm that is able to learn the common intermediate embedding space and meanwhile to make the visual data structure more distinctive. Through extensive experimental evaluation on four benchmark datasets, we demonstrate that optimizing visual space is beneficial for zero-shot learning. Besides, the proposed prototype based method achieves the new state-of-the-art performance.

Paper Structure

This paper contains 31 sections, 21 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Illustration of visual feature distribution from three different categories, i.e. Cat, Tiger and Leopard. In some cases, the inter-class variation is even smaller than the intra-class. Even the class centroids are not discrimitive enough as they may be closer with instances from other classes than some from the same class.
  • Figure 2: Illustration of the proposed method. (a)Visual prototype based method. The prototypes are learned via backpropagation. With the learned visual prototypes, the semantic representation of each class is embedded to the corresponding visual prototype rather than numerous instance features. (b) Visual feature structure optimization based method. Both semantic representations and visual features are embedded into an intermediate space. The dimensions in the embedding space are same as those in visual space.
  • Figure 3: Proposed network architectures for visual feature structure optimization based methods. (a) The architecture with the simple ranking loss and the structure optimizing loss (SRS). (b) The architecture with the bi-directional ranking loss and structure optimizing loss (BRS).
  • Figure 4: Effectiveness of the visual structure optimizing function on GZSL. SR and SRS denote the simple ranking loss without and with visual structure optimizing loss, respectively. BR and BRS indicate the bi-directional ranking loss without and with visual structure optimizing loss, respectively.
  • Figure 5: Comparison of visual centroid based and learned prototype based performance on GZSL. VCB refers the visual centroid based method, which shares the same framework of VPB but with visual centroids instead of learned visual prototypes.