From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects
Zizhao Li, Zhengkang Xiang, Joseph West, Kourosh Khoshelham
TL;DR
This work tackles open world object detection by unifying open vocabulary and open world learning. It introduces Open World Embedding Learning to discover and incrementally learn unseen objects via a pseudo unknown embedding, and Multi-Scale Contrastive Anchor Learning to tighten known class embeddings and identify NOOD objects across scales. Together, OWEL and MSCAL achieve state-of-the-art performance on OWOD benchmarks and show strong results in autonomous driving scenarios while preserving zero-shot open vocabulary capabilities. The approach avoids exemplar replay and maintains efficiency by freezing prior embeddings, offering a practical path toward robust open world perception in real-world systems.
Abstract
Traditional object detection methods operate under the closed-set assumption, where models can only detect a fixed number of objects predefined in the training set. Recent works on open vocabulary object detection (OVD) enable the detection of objects defined by an in-principle unbounded vocabulary, which reduces the cost of training models for specific tasks. However, OVD heavily relies on accurate prompts provided by an ``oracle'', which limits their use in critical applications such as driving scene perception. OVD models tend to misclassify near-out-of-distribution (NOOD) objects that have similar features to known classes, and ignore far-out-of-distribution (FOOD) objects. To address these limitations, we propose a framework that enables OVD models to operate in open world settings, by identifying and incrementally learning previously unseen objects. To detect FOOD objects, we propose Open World Embedding Learning (OWEL) and introduce the concept of Pseudo Unknown Embedding which infers the location of unknown classes in a continuous semantic space based on the information of known classes. We also propose Multi-Scale Contrastive Anchor Learning (MSCAL), which enables the identification of misclassified unknown objects by promoting the intra-class consistency of object embeddings at different scales. The proposed method achieves state-of-the-art performance on standard open world object detection and autonomous driving benchmarks while maintaining its open vocabulary object detection capability.
