OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

Matthew Inkawhich; Nathan Inkawhich; Hao Yang; Jingyang Zhang; Randolph Linderman; Yiran Chen

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

Matthew Inkawhich, Nathan Inkawhich, Hao Yang, Jingyang Zhang, Randolph Linderman, Yiran Chen

TL;DR

OSODD tackles the gap in open-set detection by requiring high recall of unknown objects in addition to accurate known-class detection. The authors propose OSR-ViT, a modular detector combining a class-agnostic proposal network (e.g., THPN) with a ViT-based foundational classifier (e.g., DINOv2), using an energy-based OOD score and a threshold-independent $AOSP$ metric for evaluation. The framework achieves state-of-the-art performance across natural imagery, limited-data, and remote-sensing benchmarks, often outperforming fully supervised baselines, especially in low-data regimes. The approach is plug-and-play and future-proof, enabling seamless integration of new proposal methods or foundational models with strong practical impact for real-world open-set detection tasks.

Abstract

An object detector's ability to detect and flag \textit{novel} objects during open-world deployments is critical for many real-world applications. Unfortunately, much of the work in open object detection today is disjointed and fails to adequately address applications that prioritize unknown object recall \textit{in addition to} known-class accuracy. To close this gap, we present a new task called Open-Set Object Detection and Discovery (OSODD) and as a solution propose the Open-Set Regions with ViT features (OSR-ViT) detection framework. OSR-ViT combines a class-agnostic proposal network with a powerful ViT-based classifier. Its modular design simplifies optimization and allows users to easily swap proposal solutions and feature extractors to best suit their application. Using our multifaceted evaluation protocol, we show that OSR-ViT obtains performance levels that far exceed state-of-the-art supervised methods. Our method also excels in low-data settings, outperforming supervised baselines using a fraction of the training data.

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

TL;DR

metric for evaluation. The framework achieves state-of-the-art performance across natural imagery, limited-data, and remote-sensing benchmarks, often outperforming fully supervised baselines, especially in low-data regimes. The approach is plug-and-play and future-proof, enabling seamless integration of new proposal methods or foundational models with strong practical impact for real-world open-set detection tasks.

Abstract

Paper Structure (15 sections, 3 equations, 8 figures, 7 tables)

This paper contains 15 sections, 3 equations, 8 figures, 7 tables.

Introduction
Limitations of Existing Work
Open Set Object Detection and Discovery
Problem Formulation
Evaluation Protocol
OSR-ViT Modular Detection Framework
Proposal Network
Foundational Classifier
Training
Experiments
Natural Imagery Benchmark
Limited Data Benchmark
Ships Benchmark
OSR-ViT Performance Analysis
Conclusion

Figures (8)

Figure 1: While other settings ignore OOD recall, the proposed OSODD task prioritizes it in addition to established metrics. In this example, a "perfect" model according to the OSOD or UAOD protocol may cause severe safety consequences.
Figure 2: Our threshold-agnostic Average Open Set Precision (AOSP) performance metric provides a holistic view of the ID-OOD performance trade-off.
Figure 3: Our OSR-ViT framework consists of two independently-trained models working in conjunction: (1) a class-agnostic Proposal Network, and (2) a ViT-powered Foundational Classifier. This allows for seamless integration of new or future models.
Figure 4: While supervised baselines struggle in the data-constrained settings of our Limited Data Benchmark, our OSR-ViT model maintains good performance.
Figure 5: 2D t-SNE visualization of penultimate features on the VOC$\rightarrow$COCO task. OSR-ViT models generate the most compact ID-class clusters, aiding in ID vs. OOD separation. Also, OSR-ViT's ability to segregate OOD instances into different (likely class-wise) clusters is extremely difficult to emulate with only task-specific supervision.
...and 3 more figures

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

TL;DR

Abstract

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

Authors

TL;DR

Abstract

Table of Contents

Figures (8)