Table of Contents
Fetching ...

YOLO-UniOW: Efficient Universal Open-World Object Detection

Lihao Liu, Juexiao Feng, Hui Chen, Ao Wang, Lin Song, Jungong Han, Guiguang Ding

TL;DR

The paper tackles the need for detectors that operate in open-world settings while accommodating open-vocabulary categories, all under real-time constraints. It introduces Uni-OWD and the YOLO-UniOW framework, featuring Efficient Adaptive Decision Learning (AdaDL) to align image and text representations directly in the CLIP latent space and a Wildcard Learning mechanism to detect unknown objects and enable vocabulary expansion without incremental learning. Key contributions include formalizing Uni-OWD, developing AdaDL with low-rank text-encoder adaptation and dual-head matching, and implementing Wildcard Learning to label unknowns while preserving known-category accuracy; empirically, it achieves strong zero-shot and open-world results, e.g., 34.6 $AP$ and 30.0 $AP_r$ on LVIS at 69.6 $FPS$, and strong OWOD performance on M-OWODB, S-OWODB, and nuScenes. This approach offers a practical, scalable path toward universal open-world detection suitable for real-time applications and edge devices, with dynamic vocabulary expansion and robust unknown-object recall.

Abstract

Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight alignment in the CLIP latent space, achieving efficient detection without compromising generalization. Additionally, we design a Wildcard Learning strategy that detects out-of-distribution objects as "unknown" while enabling dynamic vocabulary expansion without the need for incremental learning. This design empowers YOLO-UniOW to seamlessly adapt to new categories in open-world environments. Extensive experiments validate the superiority of YOLO-UniOW, achieving achieving 34.6 AP and 30.0 APr on LVIS with an inference speed of 69.6 FPS. The model also sets benchmarks on M-OWODB, S-OWODB, and nuScenes datasets, showcasing its unmatched performance in open-world object detection. Code and models are available at https://github.com/THU-MIG/YOLO-UniOW.

YOLO-UniOW: Efficient Universal Open-World Object Detection

TL;DR

The paper tackles the need for detectors that operate in open-world settings while accommodating open-vocabulary categories, all under real-time constraints. It introduces Uni-OWD and the YOLO-UniOW framework, featuring Efficient Adaptive Decision Learning (AdaDL) to align image and text representations directly in the CLIP latent space and a Wildcard Learning mechanism to detect unknown objects and enable vocabulary expansion without incremental learning. Key contributions include formalizing Uni-OWD, developing AdaDL with low-rank text-encoder adaptation and dual-head matching, and implementing Wildcard Learning to label unknowns while preserving known-category accuracy; empirically, it achieves strong zero-shot and open-world results, e.g., 34.6 and 30.0 on LVIS at 69.6 , and strong OWOD performance on M-OWODB, S-OWODB, and nuScenes. This approach offers a practical, scalable path toward universal open-world detection suitable for real-time applications and edge devices, with dynamic vocabulary expansion and robust unknown-object recall.

Abstract

Traditional object detection models are constrained by the limitations of closed-set datasets, detecting only categories encountered during training. While multimodal models have extended category recognition by aligning text and image modalities, they introduce significant inference overhead due to cross-modality fusion and still remain restricted by predefined vocabulary, leaving them ineffective at handling unknown objects in open-world scenarios. In this work, we introduce Universal Open-World Object Detection (Uni-OWD), a new paradigm that unifies open-vocabulary and open-world object detection tasks. To address the challenges of this setting, we propose YOLO-UniOW, a novel model that advances the boundaries of efficiency, versatility, and performance. YOLO-UniOW incorporates Adaptive Decision Learning to replace computationally expensive cross-modality fusion with lightweight alignment in the CLIP latent space, achieving efficient detection without compromising generalization. Additionally, we design a Wildcard Learning strategy that detects out-of-distribution objects as "unknown" while enabling dynamic vocabulary expansion without the need for incremental learning. This design empowers YOLO-UniOW to seamlessly adapt to new categories in open-world environments. Extensive experiments validate the superiority of YOLO-UniOW, achieving achieving 34.6 AP and 30.0 APr on LVIS with an inference speed of 69.6 FPS. The model also sets benchmarks on M-OWODB, S-OWODB, and nuScenes datasets, showcasing its unmatched performance in open-world object detection. Code and models are available at https://github.com/THU-MIG/YOLO-UniOW.
Paper Structure (17 sections, 6 equations, 6 figures, 6 tables)

This paper contains 17 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Speed-Accuracy Trade-off Curve. Comparison of YOLO-UniOW and recent methods in speed and accuracy on the LVIS minival dataset. Inference speed is measured on a single NVIDIA V100 GPU without TensorRT. Circle size indicates model size.
  • Figure 2: Comparisons of Detection Framework. (a) Open-vocabulary detector with cross-modality fusion. (b) Our efficient open-vocabulary detector with Adaptive Decision Learning. (c) Open-world and open-vocabulary detectors. (d) Our Uni-OWD detector for both open-vocabulary and open-world tasks.
  • Figure 3: Our Proposed Efficient Universal Open-World Object Detection Pipeline. Open-Vocabulary Pretraining (left): Using a Multimodal Dual-Head Match for efficient end-to-end object detection, AdaDL in text encoder for adaptive decision boundary learning. Open-World Fine-tuning (right): Utilizing calibrated text embeddings and the detector to adaptively detect both known and unknown objects with the assistance of the wildcard. A filtering strategy is employed to remove duplicate unknown predictions, ensuring efficient and effective open-world object detection.
  • Figure 4: The Process of Known/Wildcard Learning. The text embeddings for previously known classes are frozen, while the embeddings for currently known classes are fine-tuned using ground truth labels. The "unknown" wildcard is supervised by pseudo labels generated by the well-tuned wildcard predictions. It shows well-tuned wildcard's prediction scores and the boxes with low confidence scores or high IoU values (dashed boxes) with known class ground truth are filtered out.
  • Figure 5: Visualization Results on Zero-shot Inference on LVIS. We present visualization results with YOLO-Worldv2 both in small scale, using LVIS 1023 class names as text prompts. The model pretrained with our strategy demonstrates exceptional capability in detecting objects within complex scenes and recognizing a broader range of novel classes.
  • ...and 1 more figures