Table of Contents
Fetching ...

Plain-Det: A Plain Multi-Dataset Object Detector

Cheng Shi, Yuchen Zhu, Sibei Yang

TL;DR

Plain-Det introduces a simple yet effective framework for multi-dataset object detection by maintaining dataset-specific classification heads, employing a class-aware query compositor, and applying a hardness-indicated sampling strategy. Integrated with Def-DETR, it achieves strong cross-dataset generalization, reaching state-of-the-art COCO performance and competitive results on numerous downstream datasets while improving training efficiency. The approach rigorously addresses taxonomy conflicts, leverage of a shared semantic label space via CLIP, and dynamic dataset balancing, resulting in notable gains over prior multi-dataset detectors. However, it relies on CLIP-derived label embeddings, which may introduce biases inherent to the training data of vision-language models.

Abstract

Recent advancements in large-scale foundational models have sparked widespread interest in training highly proficient large vision models. A common consensus revolves around the necessity of aggregating extensive, high-quality annotated data. However, given the inherent challenges in annotating dense tasks in computer vision, such as object detection and segmentation, a practical strategy is to combine and leverage all available data for training purposes. In this work, we propose Plain-Det, which offers flexibility to accommodate new datasets, robustness in performance across diverse datasets, training efficiency, and compatibility with various detection architectures. We utilize Def-DETR, with the assistance of Plain-Det, to achieve a mAP of 51.9 on COCO, matching the current state-of-the-art detectors. We conduct extensive experiments on 13 downstream datasets and Plain-Det demonstrates strong generalization capability. Code is release at https://github.com/ChengShiest/Plain-Det

Plain-Det: A Plain Multi-Dataset Object Detector

TL;DR

Plain-Det introduces a simple yet effective framework for multi-dataset object detection by maintaining dataset-specific classification heads, employing a class-aware query compositor, and applying a hardness-indicated sampling strategy. Integrated with Def-DETR, it achieves strong cross-dataset generalization, reaching state-of-the-art COCO performance and competitive results on numerous downstream datasets while improving training efficiency. The approach rigorously addresses taxonomy conflicts, leverage of a shared semantic label space via CLIP, and dynamic dataset balancing, resulting in notable gains over prior multi-dataset detectors. However, it relies on CLIP-derived label embeddings, which may introduce biases inherent to the training data of vision-language models.

Abstract

Recent advancements in large-scale foundational models have sparked widespread interest in training highly proficient large vision models. A common consensus revolves around the necessity of aggregating extensive, high-quality annotated data. However, given the inherent challenges in annotating dense tasks in computer vision, such as object detection and segmentation, a practical strategy is to combine and leverage all available data for training purposes. In this work, we propose Plain-Det, which offers flexibility to accommodate new datasets, robustness in performance across diverse datasets, training efficiency, and compatibility with various detection architectures. We utilize Def-DETR, with the assistance of Plain-Det, to achieve a mAP of 51.9 on COCO, matching the current state-of-the-art detectors. We conduct extensive experiments on 13 downstream datasets and Plain-Det demonstrates strong generalization capability. Code is release at https://github.com/ChengShiest/Plain-Det
Paper Structure (15 sections, 9 equations, 4 figures, 11 tables)

This paper contains 15 sections, 9 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: The benefits and challenges of multi-dataset object detection. (a) Various datasets span diverse taxonomies and data distributions. (b) Semantic space calibration. (c) Our approach leverages the advantages of training across multiple datasets to achieve performance enhancements through scaling up data volume.
  • Figure 2: The insights for sparse proposal generation and emergent property. (a) Difference between dense proposal generation and sparse proposal generation. (b) Analysis of two types of proposal generation under multi-dataset object detection training. (c) The emergent property in multi-dataset training. The detector trained on COCO+O365+LVIS shows unstable performance on LVIS.
  • Figure 3: Method overview. Our multi-dataset detector Plain-Det is compatible with various query-based detection families. (a) Our multi-dataset joint training framework for object detection. (b) Overview of query compositor: it takes images and the label embeddings of datasets as inputs and outputs class-aware query.
  • Figure 4: Comparison of different proposal generation methods and ours. (a) Proposal generation from sparse queries. (b) Proposal generation from top-K dataset-specific pixel features in dense image feature map. (c) Our class-aware query generation relies on weak priors associated with the dataset and the image. Dataset-specific head shows we use the different frozen classification heads to calculate the loss.