Plain-Det: A Plain Multi-Dataset Object Detector
Cheng Shi, Yuchen Zhu, Sibei Yang
TL;DR
Plain-Det introduces a simple yet effective framework for multi-dataset object detection by maintaining dataset-specific classification heads, employing a class-aware query compositor, and applying a hardness-indicated sampling strategy. Integrated with Def-DETR, it achieves strong cross-dataset generalization, reaching state-of-the-art COCO performance and competitive results on numerous downstream datasets while improving training efficiency. The approach rigorously addresses taxonomy conflicts, leverage of a shared semantic label space via CLIP, and dynamic dataset balancing, resulting in notable gains over prior multi-dataset detectors. However, it relies on CLIP-derived label embeddings, which may introduce biases inherent to the training data of vision-language models.
Abstract
Recent advancements in large-scale foundational models have sparked widespread interest in training highly proficient large vision models. A common consensus revolves around the necessity of aggregating extensive, high-quality annotated data. However, given the inherent challenges in annotating dense tasks in computer vision, such as object detection and segmentation, a practical strategy is to combine and leverage all available data for training purposes. In this work, we propose Plain-Det, which offers flexibility to accommodate new datasets, robustness in performance across diverse datasets, training efficiency, and compatibility with various detection architectures. We utilize Def-DETR, with the assistance of Plain-Det, to achieve a mAP of 51.9 on COCO, matching the current state-of-the-art detectors. We conduct extensive experiments on 13 downstream datasets and Plain-Det demonstrates strong generalization capability. Code is release at https://github.com/ChengShiest/Plain-Det
