Taming Self-Training for Open-Vocabulary Object Detection

Shiyu Zhao; Samuel Schulter; Long Zhao; Zhixing Zhang; Vijay Kumar B. G; Yumin Suh; Manmohan Chandraker; Dimitris N. Metaxas

Taming Self-Training for Open-Vocabulary Object Detection

Shiyu Zhao, Samuel Schulter, Long Zhao, Zhixing Zhang, Vijay Kumar B. G, Yumin Suh, Manmohan Chandraker, Dimitris N. Metaxas

TL;DR

This work tackles open-vocabulary object detection with self-training by addressing two main challenges: noisy pseudo labels and shifting pseudo-label distributions. It introduces SAS-Det, featuring a split-and-fusion (SAF) head that separates base-ground-truth-focused localization from open-set classification, and a periodic teacher-update strategy that stabilizes pseudo-label distributions. The SAF head enables robust learning by fusing complementary predictions from a closed-branch trained on base categories and an open-branch trained on base plus pseudo labels, while periodic updates curb distribution drift. Empirically, SAS-Det achieves leading performance on COCO-OVD and LVIS-OVD with efficient pseudo labeling, outperforming recent methods and reducing the training noise that hampers open-vocabulary detection. The approach offers a practical, end-to-end pipeline that leverages CLIP-based text embeddings and external region proposals to scale open vocabulary without heavy handcrafted steps.

Abstract

Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However, teacher-student self-training, a powerful and widely used paradigm to leverage PLs, is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges, we propose SAS-Det that tames self-training for OVD from two key perspectives. First, we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover, the two branches learn complementary knowledge from different training data, significantly enhancing performance when fused together. Second, in our view, unlike in closed-set tasks, the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher, thereby decreasing the frequency of changes in PL distributions, which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks, respectively. Code is available at \url{https://github.com/xiaofeng94/SAS-Det}.

Taming Self-Training for Open-Vocabulary Object Detection

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 8 figures, 14 tables)

This paper contains 27 sections, 2 equations, 8 figures, 14 tables.

Introduction
Related Work
Approach
Adapting CLIP to OVD
Taming Self-Training
Split-and-Fusion (SAF) Head
Experiments
Experiment Setup
Comparison with the Existing Methods
Ablation Studies
Further Analysis
Conclusion
Extra Details for The Main Paper
Our detector for OVD without an external RPN
Improving initial pseudo labels with RPN scores
...and 12 more sections

Figures (8)

Figure 1: Left: Prior PL-based methods for OVD rely on handcrafted heuristics to leverage a frozen VLM for offline pseudo labels. This is usually inefficient and does not allow for improving PLs throughout training. Right: We customize self-training and finetune VLMs for OVD, which allows efficient on-the-fly computation of PLs that can be improved throughout training.
Figure 2: (a) Pipeline of our self-training. The teacher and the student are models of the same architecture. They are initialized with the same pretrained CLIP model. The teacher generates PLs that are used to train the student, and the student updates the teacher periodically. (b) Structure of our detector. The proposed SAF head is put on top of a CLIP image encoder. The open- and closed-branches take the text embeddings from a CLIP text encoder as classifier.
Figure 3: Quality of PLs during training.
Figure 4: Two stage training for an OVD detector without an external RPN. (a) In the first stage, only RPN box head is trained. Text embeddings are used for classification at inference time. (b) In the second stage, no modules are frozen.
Figure 5: Visualizations of failure cases in PLs after three updates. All samples are from COCO. Two major types of failures: (a) Redundant boxes. (b) Wrong categories.
...and 3 more figures

Taming Self-Training for Open-Vocabulary Object Detection

TL;DR

Abstract

Taming Self-Training for Open-Vocabulary Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)