Table of Contents
Fetching ...

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

Jingcai Guo, Zhijie Rao, Zhi Chen, Song Guo, Jingren Zhou, Dacheng Tao

TL;DR

This paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development and integrates three basic ZSIR tasks into a unified element-wise paradigm and provides a detailed taxonomy and analysis of the main approaches.

Abstract

Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data in the seen domain. The gist of ZSIR is constructing a well-aligned mapping between the input visual space and the target semantic space, which is a bottom-up paradigm inspired by the process by which humans observe the world. In recent years, ZSIR has witnessed significant progress on a broad spectrum, from theory to algorithm design, as well as widespread applications. However, to the best of our knowledge, there remains a lack of a systematic review of ZSIR from an element-wise perspective, i.e., learning fine-grained elements of data and their inferential associations. To fill the gap, this paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development. Concretely, we first integrate three basic ZSIR tasks, i.e., object recognition, compositional recognition, and foundation model-based open-world recognition, into a unified element-wise paradigm and provide a detailed taxonomy and analysis of the main approaches. Next, we summarize the benchmarks, covering technical implementations, standardized datasets, and some more details as a library. Last, we sketch out related applications, discuss vital challenges, and suggest potential future directions.

On the Element-Wise Representation and Reasoning in Zero-Shot Image Recognition: A Systematic Survey

TL;DR

This paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development and integrates three basic ZSIR tasks into a unified element-wise paradigm and provides a detailed taxonomy and analysis of the main approaches.

Abstract

Zero-shot image recognition (ZSIR) aims to recognize and reason in unseen domains by learning generalized knowledge from limited data in the seen domain. The gist of ZSIR is constructing a well-aligned mapping between the input visual space and the target semantic space, which is a bottom-up paradigm inspired by the process by which humans observe the world. In recent years, ZSIR has witnessed significant progress on a broad spectrum, from theory to algorithm design, as well as widespread applications. However, to the best of our knowledge, there remains a lack of a systematic review of ZSIR from an element-wise perspective, i.e., learning fine-grained elements of data and their inferential associations. To fill the gap, this paper thoroughly investigates recent advances in element-wise ZSIR and provides a sound basis for its future development. Concretely, we first integrate three basic ZSIR tasks, i.e., object recognition, compositional recognition, and foundation model-based open-world recognition, into a unified element-wise paradigm and provide a detailed taxonomy and analysis of the main approaches. Next, we summarize the benchmarks, covering technical implementations, standardized datasets, and some more details as a library. Last, we sketch out related applications, discuss vital challenges, and suggest potential future directions.
Paper Structure (60 sections, 4 equations, 6 figures, 4 tables)

This paper contains 60 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Three tasks of ZSIR: (a) ZSOR utilizes shared attributes/texts to identify unseen categories; (b) ZSCR infers unseen compositions by learning from seen objects and states; and (c) FBOR exploits the broad fundamentals learned by pre-trained VLMs to implement zero-shot recognition directly in downstream tasks.
  • Figure 2: An overview of the organization and taxonomy of this survey.
  • Figure 3: A schematic diagram of local visual attention. Visual features are passed through a sub-network to generate multiple masks, which are then multiplied to obtain enhanced features.
  • Figure 4: A schematic diagram of cross attention. Attribute embeddings and region features compute similarity to obtain multiple attention maps.
  • Figure 5: A schematic diagram of dependency modeling kim2023hierarchical. Object information is used as a signal to guide the extraction process of state features.
  • ...and 1 more figures