On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

Jingcai Guo; Zhijie Rao; Zhi Chen; Song Guo; Jingren Zhou; Dacheng Tao

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

Jingcai Guo, Zhijie Rao, Zhi Chen, Song Guo, Jingren Zhou, Dacheng Tao

TL;DR

This paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development and integrates three basic ZSIR tasks into a unified element-wise paradigm and provides a detailed taxonomy and analysis of the main approaches.

Abstract

Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data in the seen domain. The gist of ZSIR is constructing a well-aligned mapping between the input visual space and the target semantic space, which is a bottom-up paradigm inspired by the process by which humans observe the world. In recent years, ZSIR has witnessed significant progress on a broad spectrum, from theory to algorithm design, as well as widespread applications. However, to the best of our knowledge, there remains a lack of a systematic review of ZSIR from an element-wise perspective, i.e., learning fine-grained elements of data and their inferential associations. To fill the gap, this paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development. Concretely, we first integrate three basic ZSIR tasks, i.e., object recognition, compositional recognition, and foundation model-based open-world recognition, into a unified element-wise paradigm and provide a detailed taxonomy and analysis of the main approaches. Next, we summarize the benchmarks, covering technical implementations, standardized datasets, and some more details as a library. Last, we sketch out related applications, discuss vital challenges, and suggest potential future directions.

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

TL;DR

Abstract

Paper Structure (60 sections, 4 equations, 6 figures, 4 tables)

This paper contains 60 sections, 4 equations, 6 figures, 4 tables.

Introduction
Overview
Mainstream Tasks
Zero-Shot Object Recognition
Zero-Shot Compositional Recognition
Foundation Model-Based Open-World Recognition
Comparison
Challenge
Fine-Grained Visual Analysis
Domain Sift
Organization
Zero-Shot Object Recognition
Problem Formulation
Visual Component Analysis
Local Visual Attention
...and 45 more sections

Figures (6)

Figure 1: Three tasks of ZSIR: (a) ZSOR utilizes shared attributes/texts to identify unseen categories; (b) ZSCR infers unseen compositions by learning from seen objects and states; and (c) FBOR exploits the broad fundamentals learned by pre-trained VLMs to implement zero-shot recognition directly in downstream tasks.
Figure 2: An overview of the organization and taxonomy of this survey.
Figure 3: A schematic diagram of local visual attention. Visual features are passed through a sub-network to generate multiple masks, which are then multiplied to obtain enhanced features.
Figure 4: A schematic diagram of cross attention. Attribute embeddings and region features compute similarity to obtain multiple attention maps.
Figure 5: A schematic diagram of dependency modeling kim2023hierarchical. Object information is used as a signal to guide the extraction process of state features.
...and 1 more figures

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

TL;DR

Abstract

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

Authors

TL;DR

Abstract

Table of Contents

Figures (6)