Table of Contents
Fetching ...

YOLO-World: Real-Time Open-Vocabulary Object Detection

Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

TL;DR

YOLO-World tackles real-time open-vocabulary object detection by retooling the YOLO framework with vision-language pre-training. It introduces RepVL-PAN, a re-parameterizable fusion network that tightly couples text embeddings and image features, and a region-text contrastive pre-training objective to learn cross-modal representations. The method supports an offline vocabulary at inference via prompt-then-detect, delivering fast open-set detection and strong zero-shot performance on LVIS (35.4 AP at 52 FPS). Fine-tuning on downstream tasks, including open-vocabulary instance segmentation and referring object detection, demonstrates robust transfer with practical speed advantages.

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.

YOLO-World: Real-Time Open-Vocabulary Object Detection

TL;DR

YOLO-World tackles real-time open-vocabulary object detection by retooling the YOLO framework with vision-language pre-training. It introduces RepVL-PAN, a re-parameterizable fusion network that tightly couples text embeddings and image features, and a region-text contrastive pre-training objective to learn cross-modal representations. The method supports an offline vocabulary at inference via prompt-then-detect, delivering fast open-set detection and strong zero-shot performance on LVIS (35.4 AP at 52 FPS). Fine-tuning on downstream tasks, including open-vocabulary instance segmentation and referring object detection, demonstrates robust transfer with practical speed advantages.

Abstract

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation.
Paper Structure (49 sections, 7 equations, 8 figures, 9 tables)

This paper contains 49 sections, 7 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Speed-and-Accuracy Curve. We compare YOLO-World with recent open-vocabulary methods in terms of speed and accuracy. All models are evaluated on the LVIS minival and inference speeds are measured on one NVIDIA V100 w/o TensorRT. The size of the circle represents the model's size.
  • Figure 2: Comparison with Detection Paradigms.(a) Traditional Object Detector: These object detectors can only detect objects within the fixed vocabulary pre-defined by the training datasets, e.g., 80 categories of COCO dataset COCO. The fixed vocabulary limits the extension for open scenes. (b) Previous Open-Vocabulary Detectors: Previous methods tend to develop large and heavy detectors for open-vocabulary detection which intuitively have strong capacity. In addition, these detectors simultaneously encode images and texts as input for prediction, which is time-consuming for practical applications. (c) YOLO-World: We demonstrate the strong open-vocabulary performance of lightweight detectors, e.g., YOLO detectors YOLOyolov8_ultralytics, which is of great significance for real-world applications. Rather than using online vocabulary, we present a prompt-then-detect paradigm for efficient inference, in which the user generates a series of prompts according to the need and the prompts will be encoded into an offline vocabulary. Then it can be re-parameterized as the model weights for deployment and further acceleration.
  • Figure 3: Overall Architecture of YOLO-World. Compared to traditional YOLO detectors, YOLO-World as an open-vocabulary detector adopts text as input. The Text Encoder first encodes the input text input text embeddings. Then the Image Encoder encodes the input image into multi-scale image features and the proposed RepVL-PAN exploits the multi-level cross-modality fusion for both image and text features. Finally, YOLO-World predicts the regressed bounding boxes and the object embeddings for matching the categories or nouns that appeared in the input text.
  • Figure 4: Illustration of the RepVL-PAN. The proposed RepVL-PAN adopts the Text-guided CSPLayer (T-CSPLayer) for injecting language information into image features and the Image Pooling Attention (I-Pooling Attention) for enhancing image-aware text embeddings.
  • Figure 5: Visualization Results on Zero-shot Inference on LVIS. We adopt the pre-trained YOLO-World-L and infer with the LVIS vocabulary (containing 1203 categories) on the COCO val2017.
  • ...and 3 more figures