Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024
Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou, Boning Wang, Yansong Peng, Hebei Li
TL;DR
The paper investigates large-vocabulary object detection on V3Det, presenting a baseline-driven methodology and a set of architectural and loss-function improvements to address complex category labels. It introduces a PA-FPN-based adjustment to enhance shallow feature propagation, and applies DIoU and Generalized Focal Loss to improve localization and sampling balance, alongside training refinements. While these changes yield some gains over the initial baselines, they do not surpass the EVA-CLIP/MIM-based approach, and the authors observe stronger results when leveraging pretrained models and training strategies typical of large-model regimes. In Open Vocabulary Detection (OVD) Track II, retraining a base-class model on base categories still improves novel-class detection, indicating rich semantic information in V3Det supports generalization. The study underscores that in the era of large models, advanced pretraining and training protocols are crucial for effective vast-vocabulary detection, offering practical insights for future work.
Abstract
In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.
