Table of Contents
Fetching ...

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham

TL;DR

This work tackles open world object detection by unifying open vocabulary and open world learning. It introduces Open World Embedding Learning to discover and incrementally learn unseen objects via a pseudo unknown embedding, and Multi-Scale Contrastive Anchor Learning to tighten known class embeddings and identify NOOD objects across scales. Together, OWEL and MSCAL achieve state-of-the-art performance on OWOD benchmarks and show strong results in autonomous driving scenarios while preserving zero-shot open vocabulary capabilities. The approach avoids exemplar replay and maintains efficiency by freezing prior embeddings, offering a practical path toward robust open world perception in real-world systems.

Abstract

Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

TL;DR

This work tackles open world object detection by unifying open vocabulary and open world learning. It introduces Open World Embedding Learning to discover and incrementally learn unseen objects via a pseudo unknown embedding, and Multi-Scale Contrastive Anchor Learning to tighten known class embeddings and identify NOOD objects across scales. Together, OWEL and MSCAL achieve state-of-the-art performance on OWOD benchmarks and show strong results in autonomous driving scenarios while preserving zero-shot open vocabulary capabilities. The approach avoids exemplar replay and maintains efficiency by freezing prior embeddings, offering a practical path toward robust open world perception in real-world systems.

Abstract

Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.

Paper Structure

This paper contains 31 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Yolo World fails to correctly detect objects that are not included in the prompt. When we use a prompt set comprising all PASCAL VOC classes, the model misclassifies near-out-of-distribution objects (teddy bears) and ignores far-out-of-distribution objects (plates).
  • Figure 2: Overview of the proposed method. During model training, we first initialize known class embeddings with a pretrained CLIP text encoder clip. The image encoder extracts a multi-scale feature map from the input. Then the RepVL-PAN yolow uses multi-level cross-modal fusion to combine image and text features, forming the feature pyramids. The detection head predicts the class label based on image-text similarity and regresses the bounding box. The detection loss is used to update the known class embeddings. Concurrently, MSCAL modules are trained to maximize the similarity between class anchor and spatial locations at different scales, and output a multi-scale score map to indicate whether an embedding is out-of-distribution (OOD) relative to a specified class. During the inference, the OOD map extracted by MSCAL is used to reduce known-unknown confusion. In addition, the pseudo unknown embedding used to discover unknown classes is constructed from the optimized known class embeddings and the generic "objectness" semantic concept.
  • Figure 3: Inferring the Pseudo Unknown Embedding in the embedding space. For CLIP-like models, text embeddings are mapped on a unit hypersphere. The distance between the embeddings reflects the semantic similarity. In a continuous language space, there should be an embedding that represents the generic objectness. Since we know the embeddings of known classes, we can use the generic objectness as a pivot to estimate the Pseudo Unknown Embedding.
  • Figure 4: Illustration of MSCAL module. For each layer in the feature pyramid, all spatial locations will be mapped to a new space and contrasted with class anchors. The design of the projector follows Wang_2021_ICCV, which involves two 1 $\times$ 1 convolutional layers with ReLU and batch normalization. The anchor is also parameterized as 1 $\times$ 1 convolutional layer. During inference, their inner product with the class anchor serves as the OOD score.
  • Figure 5: Qualitative results on M-OWODB and nu-OWODB. Our method produces bounding boxes of known and unknown objects with better quality compared to PROB prob and EO-OWODB sun2024exploring.
  • ...and 2 more figures