A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Chaoyang Zhu; Long Chen

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Chaoyang Zhu, Long Chen

TL;DR

A comprehensive review on recent developments of OVD and OVS is provided, finding that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning.

Abstract

As the most fundamental scene understanding tasks, object detection and segmentation have made tremendous progress in deep learning era. Due to the expensive manual labeling cost, the annotated categories in existing datasets are often small-scale and pre-defined, i.e., state-of-the-art fully-supervised detectors and segmentors fail to generalize beyond the closed vocabulary. To resolve this limitation, in the last few years, the community has witnessed an increasing attention toward Open-Vocabulary Detection (OVD) and Segmentation (OVS). By ``open-vocabulary'', we mean that the models can classify objects beyond pre-defined categories. In this survey, we provide a comprehensive review on recent developments of OVD and OVS. A taxonomy is first developed to organize different tasks and methodologies. We find that the permission and usage of weak supervision signals can well discriminate different methodologies, including: visual-semantic space mapping, novel visual feature synthesis, region-aware training, pseudo-labeling, knowledge distillation, and transfer learning. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding. The main design principles, key challenges, development routes, methodology strengths, and weaknesses are thoroughly analyzed. In addition, we benchmark each task along with the vital components of each method in appendix and updated online at https://github.com/seanzhuh/awesome-open-vocabulary-detection-and-segmentation. Finally, several promising directions are provided and discussed to stimulate future research.

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

TL;DR

Abstract

Paper Structure (39 sections, 5 equations, 5 figures, 17 tables)

This paper contains 39 sections, 5 equations, 5 figures, 17 tables.

Introduction
Preliminaries
Problem Definition
Related Domains and Tasks
Canonical Closed-Set Detectors and Segmentors
Large Vision-Language Models (VLMs)
Zero-Shot Detection (ZSD)
Visual-Semantic Space Mapping
Learning a Mapping from Visual to Semantic Space
Learning a Joint Mapping of Visual-Semantic Space
Learning a Mapping from Semantic to Visual Space
Novel Visual Feature Synthesis
Zero-Shot Segmentation (ZSS)
Zero-Shot Semantic Segmentation (ZSSS)
Visual-Semantic Space Mapping
...and 24 more sections

Figures (5)

Figure 1: The proposed taxonomy. Typical models are shown in each category. VLMs-IE denote the image encoder of VLMs.
Figure 2: A general comparison of each methodology. "Vis. Feats" and "Sem. Embs" are visual features and semantic embeddings word2vecglovefasttextbert, respectively.
Figure 3: Flowchart of novel visual feature synthesis.
Figure 4: A basic pipeline of knowledge distillation methodology. Distillation loss is typically a $\mathcal{L}_1$ loss. We omit the localization branch for brevity.
Figure 5: Framework for transfer learning-based models.

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

TL;DR

Abstract

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Authors

TL;DR

Abstract

Table of Contents

Figures (5)