Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

Prakash Chandra Chhipa; Kanjar De; Meenakshi Subhash Chippa; Rajkumar Saini; Marcus Liwicki

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

Prakash Chandra Chhipa, Kanjar De, Meenakshi Subhash Chippa, Rajkumar Saini, Marcus Liwicki

TL;DR

This study presents a comprehensive robustness evaluation of the zero-shot capabilities of three recent open-vocabulary (OV) foundation object detection models: OWL-ViT, YOLO World, and Grounding DINO.

Abstract

The challenge of Out-Of-Distribution (OOD) robustness remains a critical hurdle towards deploying deep vision models. Vision-Language Models (VLMs) have recently achieved groundbreaking results. VLM-based open-vocabulary object detection extends the capabilities of traditional object detection frameworks, enabling the recognition and classification of objects beyond predefined categories. Investigating OOD robustness in recent open-vocabulary object detection is essential to increase the trustworthiness of these models. This study presents a comprehensive robustness evaluation of the zero-shot capabilities of three recent open-vocabulary (OV) foundation object detection models: OWL-ViT, YOLO World, and Grounding DINO. Experiments carried out on the robustness benchmarks COCO-O, COCO-DC, and COCO-C encompassing distribution shifts due to information loss, corruption, adversarial attacks, and geometrical deformation, highlighting the challenges of the model's robustness to foster the research for achieving robustness. Project page: https://prakashchhipa.github.io/projects/ovod_robustness

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 7 figures, 7 tables)

This paper contains 16 sections, 1 equation, 7 figures, 7 tables.

Introduction
Related Work
Out-of-Distribution benchmarks
COCO-O
COCO-DC
COCO-C
Open-Vocabulary Object detectors Models
OWL-ViT
YOLO-World
Grounding DINO
Experiments and Results
Evaluation Method
Metrics
Discussions
Conclusion
...and 1 more sections

Figures (7)

Figure 1: Zero-shot performance comparison for open vocabulary object detection models, OWL-ViT minderer2022simple(ECCV'22), YOLO World cheng2024yolow (CVPR'24), and Grounding DINO liu2023grounding (ECCV'24). COCO-O mao2023coco (ICCV'23) represents average results on six subsets, and COCO-C michaelis2019benchmarking represents average results on fifteen corruptions.
Figure 2: Zero-shot performance on COCO-DC: (left): comparison for OWL-ViT, YOLO World, and Grounding DINO on COCO-DC robustness performance on original subset and adversarial subset. (right): comparison for these foundation models on COCO-DC robustness performance on original subset and average of all remaining subsets.
Figure 3: Examples from six COCO-O benchmark subsets depicted with predictions by open-vocabulary models: OWL-ViT, YOLO World, and Grounding DINO. The input textual query includes the object categories identified in the labels.
Figure 4: Zero-shot evaluation process of open vocabulary object detector models.
Figure 5: Comparisons of effective robustness for detectors based on their performance on original COCO and COCO-O datasets.
...and 2 more figures

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

TL;DR

Abstract

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

Authors

TL;DR

Abstract

Table of Contents

Figures (7)