Table of Contents
Fetching ...

CerberusDet: Unified Multi-Dataset Object Detection

Irina Tolstykh, Mikhail Chernyshov, Maksim Kuprashevich

TL;DR

CerberusDet is introduced, a framework with a multi-headed model designed for handling multiple object detection tasks built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads.

Abstract

Conventional object detection models are usually limited by the data on which they were trained and by the category logic they define. With the recent rise of Language-Visual Models, new methods have emerged that are not restricted to these fixed categories. Despite their flexibility, such Open Vocabulary detection models still fall short in accuracy compared to traditional models with fixed classes. At the same time, more accurate data-specific models face challenges when there is a need to extend classes or merge different datasets for training. The latter often cannot be combined due to different logics or conflicting class definitions, making it difficult to improve a model without compromising its performance. In this paper, we introduce CerberusDet, a framework with a multi-headed model designed for handling multiple object detection tasks. Proposed model is built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads. This approach allows CerberusDet to perform very efficiently while still delivering optimal results. We evaluated the model on the PASCAL VOC dataset and Objects365 dataset to demonstrate its abilities. CerberusDet achieved state-of-the-art results with 36% less inference time. The more tasks are trained together, the more efficient the proposed model becomes compared to running individual models sequentially. The training and inference code, as well as the model, are available as open-source (https://github.com/ai-forever/CerberusDet).

CerberusDet: Unified Multi-Dataset Object Detection

TL;DR

CerberusDet is introduced, a framework with a multi-headed model designed for handling multiple object detection tasks built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads.

Abstract

Conventional object detection models are usually limited by the data on which they were trained and by the category logic they define. With the recent rise of Language-Visual Models, new methods have emerged that are not restricted to these fixed categories. Despite their flexibility, such Open Vocabulary detection models still fall short in accuracy compared to traditional models with fixed classes. At the same time, more accurate data-specific models face challenges when there is a need to extend classes or merge different datasets for training. The latter often cannot be combined due to different logics or conflicting class definitions, making it difficult to improve a model without compromising its performance. In this paper, we introduce CerberusDet, a framework with a multi-headed model designed for handling multiple object detection tasks. Proposed model is built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads. This approach allows CerberusDet to perform very efficiently while still delivering optimal results. We evaluated the model on the PASCAL VOC dataset and Objects365 dataset to demonstrate its abilities. CerberusDet achieved state-of-the-art results with 36% less inference time. The more tasks are trained together, the more efficient the proposed model becomes compared to running individual models sequentially. The training and inference code, as well as the model, are available as open-source (https://github.com/ai-forever/CerberusDet).
Paper Structure (13 sections, 3 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 3 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of the work of the CerberusDet trained on three datasets with different labels. We trained a model using the PASCAL VOC dataset and two subsets from the Objects365 dataset with animals and tableware categories. See training details in Section \ref{['section:experiments']}.
  • Figure 2: Diagram of an example of the CerberusDet architecture based on YOLOv8, illustrated with three tasks. Each neck module can be shared between tasks or be task-specific. The CerberusDet model optimizes computational resources by sharing all backbone parameters across tasks, while each task retains its own unique set of parameters for the head.
  • Figure 3: Comparison of inference time of YOLOv8x-based CerberusDet models and the sequence of individual models. Measurements were made with FP16 precision on a V100 GPU with a batch size of 32.
  • Figure 4: $rsa\ score$, $computational\ score$ and average mAP of 4 different models trained for 3 tasks.