Table of Contents
Fetching ...

Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou, Boning Wang, Yansong Peng, Hebei Li

TL;DR

The paper investigates large-vocabulary object detection on V3Det, presenting a baseline-driven methodology and a set of architectural and loss-function improvements to address complex category labels. It introduces a PA-FPN-based adjustment to enhance shallow feature propagation, and applies DIoU and Generalized Focal Loss to improve localization and sampling balance, alongside training refinements. While these changes yield some gains over the initial baselines, they do not surpass the EVA-CLIP/MIM-based approach, and the authors observe stronger results when leveraging pretrained models and training strategies typical of large-model regimes. In Open Vocabulary Detection (OVD) Track II, retraining a base-class model on base categories still improves novel-class detection, indicating rich semantic information in V3Det supports generalization. The study underscores that in the era of large models, advanced pretraining and training protocols are crucial for effective vast-vocabulary detection, offering practical insights for future work.

Abstract

In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.

Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

TL;DR

The paper investigates large-vocabulary object detection on V3Det, presenting a baseline-driven methodology and a set of architectural and loss-function improvements to address complex category labels. It introduces a PA-FPN-based adjustment to enhance shallow feature propagation, and applies DIoU and Generalized Focal Loss to improve localization and sampling balance, alongside training refinements. While these changes yield some gains over the initial baselines, they do not surpass the EVA-CLIP/MIM-based approach, and the authors observe stronger results when leveraging pretrained models and training strategies typical of large-model regimes. In Open Vocabulary Detection (OVD) Track II, retraining a base-class model on base categories still improves novel-class detection, indicating rich semantic information in V3Det supports generalization. The study underscores that in the era of large models, advanced pretraining and training protocols are crucial for effective vast-vocabulary detection, offering practical insights for future work.

Abstract

In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.
Paper Structure (12 sections, 3 equations, 3 figures, 3 tables)

This paper contains 12 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: V3Det is a high-quality, precisely annotated object detection dataset with a broad vocabulary, encompassing 13,204 categories. The figure shows annotated image samples from V3Det, featuring more complex and detailed annotations.
  • Figure 2: Illustration of PA-FPN structure, FPN with bottom-up path structure from $N_2$ to $N_5$.
  • Figure 3: The horizontal axis represents different classes, and the vertical axis represents the number of samples corresponding to each class, with values above 1000 not displayed.