Table of Contents
Fetching ...

Towards Open Vocabulary Learning: A Survey

Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Xia Li, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, Dacheng Tao

TL;DR

The paper surveys open vocabulary learning for visual scene understanding, addressing the gap between closed-set models and the ability to recognize novel categories via language supervision and vision-language pre-training. It details concept definitions, histories, datasets, and metrics, then systematically reviews methods across detection, segmentation, video understanding, and 3D understanding, organized around five core design principles and aided by credible baselines like CLIP and DETR-based architectures. The work highlights practical challenges, including data costs, overfitting to base classes, and cross-dataset evaluation, and outlines future directions such as temporal integration, diffusion-model–assisted segmentation, cross-modal adapters, and alignment with large language models. Overall, the survey scopes the rapid evolution of open vocabulary approaches, offering a consolidated framework and actionable insights for researchers and practitioners aiming to deploy flexible, scalable vision systems. $C_B$, $C_N$, and $C_L$ formalize the label and vocabulary spaces central to these methods, while metrics like $mAP$, $mIoU$, and $PQ$ provide nuanced evaluation of novel-class generalization.$

Abstract

In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.

Towards Open Vocabulary Learning: A Survey

TL;DR

The paper surveys open vocabulary learning for visual scene understanding, addressing the gap between closed-set models and the ability to recognize novel categories via language supervision and vision-language pre-training. It details concept definitions, histories, datasets, and metrics, then systematically reviews methods across detection, segmentation, video understanding, and 3D understanding, organized around five core design principles and aided by credible baselines like CLIP and DETR-based architectures. The work highlights practical challenges, including data costs, overfitting to base classes, and cross-dataset evaluation, and outlines future directions such as temporal integration, diffusion-model–assisted segmentation, cross-modal adapters, and alignment with large language models. Overall, the survey scopes the rapid evolution of open vocabulary approaches, offering a consolidated framework and actionable insights for researchers and practitioners aiming to deploy flexible, scalable vision systems. , , and formalize the label and vocabulary spaces central to these methods, while metrics like , , and provide nuanced evaluation of novel-class generalization.$

Abstract

In the field of visual scene understanding, deep neural networks have made impressive advancements in various core tasks like segmentation, tracking, and detection. However, most approaches operate on the close-set assumption, meaning that the model can only identify pre-defined categories that are present in the training set. Recently, open vocabulary settings were proposed due to the rapid progress of vision language pre-training. These new approaches seek to locate and recognize categories beyond the annotated label space. The open vocabulary approach is more general, practical, and effective compared to weakly supervised and zero-shot settings. This paper provides a thorough review of open vocabulary learning, summarizing and analyzing recent developments in the field. In particular, we begin by comparing it to related concepts such as zero-shot learning, open-set recognition, and out-of-distribution detection. Then, we review several closely related tasks in the case of segmentation and detection, including long-tail problems, few-shot, and zero-shot settings. For the method survey, we first present the basic knowledge of detection and segmentation in close-set as the preliminary knowledge. Next, we examine various scenarios in which open vocabulary learning is used, identifying common design elements and core ideas. Then, we compare the recent detection and segmentation approaches in commonly used datasets and benchmarks. Finally, we conclude with insights, issues, and discussions regarding future research directions. To our knowledge, this is the first comprehensive literature review of open vocabulary learning. We keep tracing related works at https://github.com/jianzongwu/Awesome-Open-Vocabulary.
Paper Structure (27 sections, 1 equation, 5 figures, 16 tables)

This paper contains 27 sections, 1 equation, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Concepts comparison between open-set/open world/out-of-distribution detection (OOD), zero-shot and open vocabulary. Different shapes represent different novel categories. Colors represent the predictions of the novel objects. (a), in the open-set/open World/OOD settings, the model only needs to identify novel classes and mark them as "unknown". (b), in the zero-shot setting, a model must classify unknown classes into specific categories. (c), in the open vocabulary settings, the model can classify novel classes with the help of large language vocabulary knowledge $C_L$.
  • Figure 2: Timeline of open vocabulary learning. The gray boxes indicate representative works. Green boxes indicate the foundation models and VLMs. In open vocabulary learning, many works exploit the knowledge learned by pre-trained vision foundation models like Swin liu2021swin and VLMs like CLIP CLIP. Recently, some works also explore the use of diffusion models in this setting.
  • Figure 3: Summarization on open vocabulary learning works. (a) The number of research works still increases per year. (b) Detection and segmentation have more papers than 3D and video. (c) indicates each direction per year. The results are obtained on 2024/1/15.
  • Figure 4: Open vocabulary learning methods, organized by their tasks and approach types. We list several representative works here.
  • Figure 5: An illustration of a common architecture in open vocabulary object detection and segmentation. The vision model predicts a class embedding for each box/mask. The embeddings are compared to a set of class embeddings generated by a VLM text model like CLIP or ALIGN, using dot products. The class with the highest score is chosen as the predicted label for the object. Note that while humans define the set of possible object classes, the system only has access to a limited set of "base" classes during training.