COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Jiansheng Li, Xingxuan Zhang, Hao Zou, Yige Guo, Renzhe Xu, Yilong Liu, Chuzhao Zhu, Yue He, Peng Cui
TL;DR
COUNTS addresses a critical gap in OOD generalization research by providing a large-scale, finely annotated dataset tailored for both object detection and grounding under natural distribution shifts. It introduces two benchmarks, $O(OD)^{2}$ and OODG, to systematically evaluate OOD performance in detectors and multimodal language models, respectively. Across extensive experiments, the study finds that while IID gains from scale and pretraining improve in-distribution performance, they do not reliably transfer to OOD settings; grounding models show substantial susceptibility to in-context learning distribution shifts, with large models like GPT-4o and Gemini exhibiting varying degrees of reliance on ICE and robustness. The COUNTS benchmarks and findings highlight concrete directions for developing more robust detectors and grounding-capable MLLMs that maintain high performance when faced with distributional shifts in real-world deployment.
Abstract
Current object detectors often suffer significant perfor-mance degradation in real-world applications when encountering distributional shifts. Consequently, the out-of-distribution (OOD) generalization capability of object detectors has garnered increasing attention from researchers. Despite this growing interest, there remains a lack of a large-scale, comprehensive dataset and evaluation benchmark with fine-grained annotations tailored to assess the OOD generalization on more intricate tasks like object detection and grounding. To address this gap, we introduce COUNTS, a large-scale OOD dataset with object-level annotations. COUNTS encompasses 14 natural distributional shifts, over 222K samples, and more than 1,196K labeled bounding boxes. Leveraging COUNTS, we introduce two novel benchmarks: O(OD)2 and OODG. O(OD)2 is designed to comprehensively evaluate the OOD generalization capabilities of object detectors by utilizing controlled distribution shifts between training and testing data. OODG, on the other hand, aims to assess the OOD generalization of grounding abilities in multimodal large language models (MLLMs). Our findings reveal that, while large models and extensive pre-training data substantially en hance performance in in-distribution (IID) scenarios, significant limitations and opportunities for improvement persist in OOD contexts for both object detectors and MLLMs. In visual grounding tasks, even the advanced GPT-4o and Gemini-1.5 only achieve 56.7% and 28.0% accuracy, respectively. We hope COUNTS facilitates advancements in the development and assessment of robust object detectors and MLLMs capable of maintaining high performance under distributional shifts.
