Table of Contents
Fetching ...

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

Ziming Liu, Jingcai Guo, Song Guo, Xiaocheng Lu

TL;DR

Epsilon tackles multi-label zero-shot learning by addressing the incomplete alignment between local object-level cues and global scene semantics. It introduces two modules: Group Prompts Aggregation (GPA) for refining local features through semantic grouping, and Global Forward Propagation (GFP) for enriching global semantic diversity, combined through a learnable fusion and a ranking-based loss with regularization. The approach achieves superior ZSL and GZSL performance on large-scale benchmarks (NUS-Wide and Open-Images-V4), outperforming state-of-the-art methods and demonstrating strong generalization to unseen labels while maintaining seen-label accuracy. The work advances practical MLZSL by providing a coherent, end-to-end framework that preserves semantic integrity across spatial and global dimensions, enabling more reliable visual-semantic transfer in real-world multi-label scenarios.

Abstract

This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.

Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning

TL;DR

Epsilon tackles multi-label zero-shot learning by addressing the incomplete alignment between local object-level cues and global scene semantics. It introduces two modules: Group Prompts Aggregation (GPA) for refining local features through semantic grouping, and Global Forward Propagation (GFP) for enriching global semantic diversity, combined through a learnable fusion and a ranking-based loss with regularization. The approach achieves superior ZSL and GZSL performance on large-scale benchmarks (NUS-Wide and Open-Images-V4), outperforming state-of-the-art methods and demonstrating strong generalization to unseen labels while maintaining seen-label accuracy. The work advances practical MLZSL by providing a coherent, end-to-end framework that preserves semantic integrity across spatial and global dimensions, enabling more reliable visual-semantic transfer in real-world multi-label scenarios.

Abstract

This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL), wherein the model is trained to recognize multiple unseen classes within a sample (e.g., an image) based on seen classes and auxiliary knowledge, e.g., semantic information. Existing methods usually resort to analyzing the relationship of various seen classes residing in a sample from the dimension of spatial or semantic characteristics and transferring the learned model to unseen ones. However, they neglect the integrity of local and global features. Although the use of the attention structure will accurately locate local features, especially objects, it will significantly lose its integrity, and the relationship between classes will also be affected. Rough processing of global features will also directly affect comprehensiveness. This neglect will make the model lose its grasp of the main components of the image. Relying only on the local existence of seen classes during the inference stage introduces unavoidable bias. In this paper, we propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties and enable a more accurate and robust visual-semantic projection. In terms of spatial information, we achieve effective refinement by group aggregating image features into several semantic prompts. It can aggregate semantic information rather than class information, preserving the correlation between semantics. In terms of global semantics, we use global forward propagation to collect as much information as possible to ensure that semantics are not omitted. Experiments on large-scale MLZSL benchmark datasets NUS-Wide and Open-Images-v4 demonstrate that the proposed Epsilon outperforms other state-of-the-art methods with large margins.
Paper Structure (19 sections, 13 equations, 6 figures, 6 tables)

This paper contains 19 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison of attention between our proposed Epsilon and traditional spatial attention-based models. We can see that the proposed model is stronger than the contrasting BiAM and LESA in terms of completeness of class semantics. (zoom in for a better view)
  • Figure 2: Pipeline of Epsilon. The feature representation of the image is first obtained through a pre-trained backbone network that is frozen and cannot be updated. The image features are then applied to the Group Prompts Aggregation Module (GPA Module), which represents local semantic generation, and the Global Forward Propagation Module (GFP Module), which represents global diversity semantic generation. Finally, the output of the two modules is integrated to obtain the complete semantics. (zoom in for a better view)
  • Figure 3: Hyper-parameter selection. All the experiments are performed on the NUS-Wide test-set.
  • Figure 4: Top-10 labels predicted by Epsilon in the case of Generalized MLZSL on NUS-Wide dataset. The asterisk marks indicate unseen labels, while the bold ones indicate successfully predicted seen and unseen labels.
  • Figure 5: Qualitative results. The top-10 labels predicted by Epsilon in the case of Generalized MLZSL on NUS-Wide dataset are shown above. The asterisk marks indicate unseen labels, while the bold ones indicate successfully predicted seen and unseen labels.
  • ...and 1 more figures