Table of Contents
Fetching ...

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Shilin Xu, Xiangtai Li, Size Wu, Wenwei Zhang, Yunhai Tong, Chen Change Loy

TL;DR

This paper tackles open-vocabulary object detection (OVOD) by addressing the gap between training, where novel objects are treated as background, and testing, where they must be detected. It introduces DST-Det, which uses a dynamically updated Pseudo-Labeling Module (PLM) that leverages CLIP to identify potential novel objects among negative proposals during training, and then uses these pseudo labels to supervise both the RPN and RoIHead within a two-stage detector that employs a frozen CLIP backbone. The method achieves consistent, state-of-the-art gains across LVIS, COCO, and V3Det benchmarks without requiring extra unlabeled data or retraining, and can be paired with strong VLM-based baselines (e.g., CLIPSelf) to reach high novel-class AP figures. The approach offers a practical, plug-in component for improving OVOD performance with minimal inference-time overhead, highlighting the value of integrating vision-language knowledge directly into the training loop.

Abstract

Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-language models (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method will select a subset of proposals that will be considered as background during the training. Then, we treat them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not require re-training and offline labeling processing, which is more efficient and effective in one-shot training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In addition, we also apply our method to various baselines. In particular, compared with the previous method, F-VLM, our method achieves a 1.7% improvement on the LVIS dataset. Combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining. We also achieve over 6.5% improvement over the F-VLM baseline in the recent challenging V3Det dataset. We release our code and models at https://github.com/xushilin1/dst-det.

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

TL;DR

This paper tackles open-vocabulary object detection (OVOD) by addressing the gap between training, where novel objects are treated as background, and testing, where they must be detected. It introduces DST-Det, which uses a dynamically updated Pseudo-Labeling Module (PLM) that leverages CLIP to identify potential novel objects among negative proposals during training, and then uses these pseudo labels to supervise both the RPN and RoIHead within a two-stage detector that employs a frozen CLIP backbone. The method achieves consistent, state-of-the-art gains across LVIS, COCO, and V3Det benchmarks without requiring extra unlabeled data or retraining, and can be paired with strong VLM-based baselines (e.g., CLIPSelf) to reach high novel-class AP figures. The approach offers a practical, plug-in component for improving OVOD performance with minimal inference-time overhead, highlighting the value of integrating vision-language knowledge directly into the training loop.

Abstract

Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work introduces a straightforward and efficient strategy that utilizes pre-trained vision-language models (VLM), like CLIP, to identify potential novel classes through zero-shot classification. Previous methods use a class-agnostic region proposal network to detect object proposals and consider the proposals that do not match the ground truth as background. Unlike these methods, our method will select a subset of proposals that will be considered as background during the training. Then, we treat them as novel classes during training. We refer to this approach as the self-training strategy, which enhances recall and accuracy for novel classes without requiring extra annotations, datasets, and re-training. Compared to previous pseudo methods, our approach does not require re-training and offline labeling processing, which is more efficient and effective in one-shot training. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In addition, we also apply our method to various baselines. In particular, compared with the previous method, F-VLM, our method achieves a 1.7% improvement on the LVIS dataset. Combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining. We also achieve over 6.5% improvement over the F-VLM baseline in the recent challenging V3Det dataset. We release our code and models at https://github.com/xushilin1/dst-det.
Paper Structure (12 sections, 3 equations, 7 figures, 7 tables)

This paper contains 12 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Compared with previous methods using the pseudo labels. (a). Previous methods pb-ovdvl-plmzhao2023improving usually obtain pseudo labels from extra unlabeled data. They first train a class-agnostic detector to generate object proposals and classify them using VLMs. (b). Our DST-Det constructs an end-to-end pipeline, generates pseudo labels during training, and does not use any extra data.
  • Figure 2: Illustration of our motivation and framework. (a). Our DST-Det incorporates novel class labels to supervise the detection head during training. (b). Experiments on OV-COCO and OV-LVIS using CLIP with ground truth box for zero-shot classification. We observe high top-1 and top-5 accuracy in classifying novel classes. (c). Illustration of our dynamic self-training pipeline with the pseudo labels.
  • Figure 3: Illustration of DST framework. (a) The meta-architecture of DST-Det includes the proposed pseudo-labeling module (PLM), which is integrated into two stages of the detector: RPN and RoIHead. (b) The proposed pseudo-labeling module (PLM). During training, PLM takes the top-level feature map from the image encoder and text embedding of object classes as input and generates the pseudo labels for the RPN and the RoIHead. (c) The process of extracting CLIP representation for region proposals. The RoIAlign operation is applied to the top-level feature map, the output of which is then pooled by the Attention Pooling layer (AttnPooling) of the CLIP image encoder.
  • Figure 4: Visual Analysis of DST framework. (a), We present a t-SNE analysis on the novel region embeddings during training. Different colors represent different classes. We find that using fewer training samples works well. (b), We show visual improvements over the strong baseline. Our method can detect and segment novel classes, as shown on the right side of the data pair.
  • Figure 5: Pseudo Label Visual Examples. Left: We visualize the class-agnostic ground truth bounding boxes. The green boxes represent the ground truth of base classes and will be used as foreground supervision, while the red boxes represent the ground truth of possible novel classes that are not allowed during training. Right: The red boxes represent the pseudo labels we selected from negative proposals.
  • ...and 2 more figures