Table of Contents
Fetching ...

OW-Rep: Open World Object Detection with Instance Representation Learning

Sunoh Lee, Minsik Jeon, Jihong Min, Junwon Seo

TL;DR

This work extends Open World Object Detection by adding semantically rich instance representations alongside unknown object detection. It introduces two training-time modules that leverage Vision Foundation Models: Unknown Box Refine Module with SAM for accurate unknown localization, and Embedding Transfer Module that distills DINOv2–based semantic relationships into the detector's embeddings via a relaxed contrastive loss. The approach yields improved unknown object detection, richer instance embeddings, and better performance on open-world tracking across OWOD and Unknown-Unknown splits, while remaining computationally efficient. These results demonstrate the value of integrating foundation-model-derived supervision into open-world perception for robust scene understanding.

Abstract

Open World Object Detection(OWOD) addresses realistic scenarios where unseen object classes emerge, enabling detectors trained on known classes to detect unknown objects and incrementally incorporate the knowledge they provide. While existing OWOD methods primarily focus on detecting unknown objects, they often overlook the rich semantic relationships between detected objects, which are essential for scene understanding and applications in open-world environments (e.g., open-world tracking and novel class discovery). In this paper, we extend the OWOD framework to jointly detect unknown objects and learn semantically rich instance embeddings, enabling the detector to capture fine-grained semantic relationships between instances. To this end, we propose two modules that leverage the rich and generalizable knowledge of Vision Foundation Models(VFMs) and can be integrated into open-world object detectors. First, the Unknown Box Refine Module uses instance masks from the Segment Anything Model to accurately localize unknown objects. The Embedding Transfer Module then distills instance-wise semantic similarities from VFM features to the detector's embeddings via a relaxed contrastive loss, enabling the detector to learn a semantically meaningful and generalizable instance feature. Extensive experiments show that our method significantly improves both unknown object detection and instance embedding quality, while also enhancing performance in downstream tasks such as open-world tracking.

OW-Rep: Open World Object Detection with Instance Representation Learning

TL;DR

This work extends Open World Object Detection by adding semantically rich instance representations alongside unknown object detection. It introduces two training-time modules that leverage Vision Foundation Models: Unknown Box Refine Module with SAM for accurate unknown localization, and Embedding Transfer Module that distills DINOv2–based semantic relationships into the detector's embeddings via a relaxed contrastive loss. The approach yields improved unknown object detection, richer instance embeddings, and better performance on open-world tracking across OWOD and Unknown-Unknown splits, while remaining computationally efficient. These results demonstrate the value of integrating foundation-model-derived supervision into open-world perception for robust scene understanding.

Abstract

Open World Object Detection(OWOD) addresses realistic scenarios where unseen object classes emerge, enabling detectors trained on known classes to detect unknown objects and incrementally incorporate the knowledge they provide. While existing OWOD methods primarily focus on detecting unknown objects, they often overlook the rich semantic relationships between detected objects, which are essential for scene understanding and applications in open-world environments (e.g., open-world tracking and novel class discovery). In this paper, we extend the OWOD framework to jointly detect unknown objects and learn semantically rich instance embeddings, enabling the detector to capture fine-grained semantic relationships between instances. To this end, we propose two modules that leverage the rich and generalizable knowledge of Vision Foundation Models(VFMs) and can be integrated into open-world object detectors. First, the Unknown Box Refine Module uses instance masks from the Segment Anything Model to accurately localize unknown objects. The Embedding Transfer Module then distills instance-wise semantic similarities from VFM features to the detector's embeddings via a relaxed contrastive loss, enabling the detector to learn a semantically meaningful and generalizable instance feature. Extensive experiments show that our method significantly improves both unknown object detection and instance embedding quality, while also enhancing performance in downstream tasks such as open-world tracking.
Paper Structure (33 sections, 4 equations, 12 figures, 15 tables)

This paper contains 33 sections, 4 equations, 12 figures, 15 tables.

Figures (12)

  • Figure 1: We propose a method for training an open-world object detector that not only detects unknown objects but also learns semantically rich feature embeddings that capture meaningful inter-object relationships. Existing OWOD methods joseph2021towardsgupta2022owzohar2023probdoan_2024_HypOW primarily focus on detecting unknown objects but overlook the semantic relationships between proposals. Our approach explicitly captures these relationships by enhanced instance embeddings through VFM oquab2023dinov2kirillov2023segment distillation, while also improving unknown object detection.
  • Figure 2: Overall Architecture of the Proposed Method. Our method extends OWOD by not only detecting unknown objects but also extracting semantically rich features. We adopt PROBzohar2023prob as the baseline detector, but any other open-world object detector can be utilized as a baseline. During training, the known and unknown proposals from PROB, with corresponding instance embeddings, are fed into the proposed modules. The Unknown Box Refine Module improves the localization of unknown objects by treating refined unknown boxes from SAM kirillov2023segment as pseudo ground truth. The Embedding Transfer Module extracts source embeddings by average pooling DINOv2 oquab2023dinov2 features within the refined unknown and known proposals. Pairwise similarities between source embeddings are then computed and used as weights for the relaxed contrastive loss kim2021embedding, controlling the attraction and repulsion between instance embeddings. At inference, the detector generates semantically rich instance embeddings, capturing fine-grained relationships between detected proposals.
  • Figure 3: Qualitative results for inter-proposal relationships on OWOD and Unknown-Unknown split. Proposals with high feature similarity to the reference are shown red, while dissimilar proposals are in blue. Ours successfully captures semantic similarities between both the known and unknown objects. For example, the reference giraffe is similar to both an unknown giraffe and a known horse, while the fire hydrant is highly dissimilar. In contrast, PROB treats all proposals as highly similar. RNCDL, despite using self-supervision to learn features, fails to capture meaningful semantics, mistakenly considering the giraffe and fire hydrant highly similar.
  • Figure 4: Qualitative results of unknown object detection. Unknown object detections from PROB (top row) and our model (bottom row) are compared. By leveraging instance masks from SAM, our model achieves accurate localization.
  • Figure 5: t-SNE visualization of the learned instance embeddings. Our method learns a rich instance embedding space, capturing semantic relationship between objects.
  • ...and 7 more figures