Adaptive Self-Training for Object Detection

Renaud Vandeghen; Gilles Louppe; Marc Van Droogenbroeck

Adaptive Self-Training for Object Detection

Renaud Vandeghen, Gilles Louppe, Marc Van Droogenbroeck

TL;DR

ASTOD tackles semi-supervised object detection by automatically adapting the pseudo-label threshold to data distribution through a score-histogram ground thresholding strategy. It combines multi-view pseudo-label generation with a teacher-student loop and a per-pseudo-label weighting scheme to downscale uncertain labels, followed by iterative refinement. The method achieves competitive or superior performance on MS-COCO under various labeling percentages and demonstrates robust adaptation on satellite imagery (DIOR) without manual threshold tuning. This approach reduces the need for dataset-specific threshold sweeps and offers a practical, scalable solution for SSOD in diverse domains.

Abstract

Deep learning has emerged as an effective solution for solving the task of object detection in images but at the cost of requiring large labeled datasets. To mitigate this cost, semi-supervised object detection methods, which consist in leveraging abundant unlabeled data, have been proposed and have already shown impressive results. However, most of these methods require linking a pseudo-label to a ground-truth object by thresholding. In previous works, this threshold value is usually determined empirically, which is time consuming, and only done for a single data distribution. When the domain, and thus the data distribution, changes, a new and costly parameter search is necessary. In this work, we introduce our method Adaptive Self-Training for Object Detection (ASTOD), which is a simple yet effective teacher-student method. ASTOD determines without cost a threshold value based directly on the ground value of the score histogram. To improve the quality of the teacher predictions, we also propose a novel pseudo-labeling procedure. We use different views of the unlabeled images during the pseudo-labeling step to reduce the number of missed predictions and thus obtain better candidate labels. Our teacher and our student are trained separately, and our method can be used in an iterative fashion by replacing the teacher by the student. On the MS-COCO dataset, our method consistently performs favorably against state-of-the-art methods that do not require a threshold parameter, and shows competitive results with methods that require a parameter sweep search. Additional experiments with respect to a supervised baseline on the DIOR dataset containing satellite images lead to similar conclusions, and prove that it is possible to adapt the score threshold automatically in self-training, regardless of the data distribution. The code is available at https:// github.com/rvandeghen/ASTOD

Adaptive Self-Training for Object Detection

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 5 figures, 7 tables)

This paper contains 11 sections, 4 equations, 5 figures, 7 tables.

Introduction
Related Work
Method
Experiments
Experimental setup
Results
Ablation study
Conclusion
Supplementary Material
Implementation details.
Student training.

Figures (5)

Figure 1: Pipeline of our self-training ASTOD method. (1) A teacher is trained with the labeled dataset. (2) We use the teacher to generate candidate labels on the unlabeled data using multiple views. We apply the inverse view transformation to gather the different predictions in the same dimensional space. The predictions are then merged with NMS. (3) Based on the confidence score histogram, we determine the threshold value $\tau$ to filter the candidate boxes, leading to a pseudo-labeled dataset. (4) Next, we train the student with the labeled and pseudo-labeled datasets. (5) Finally, we refine the student with the labeled dataset and use it to replace the teacher. ASTOD can then be used in an iterative fashion by replacing the teacher (2) with the refined student.
Figure 2: Comparison between the candidate labels for the different views. The normal view (a) misses two snowboards. Both flipped and scaled+flipped views (c) and (d) miss the small snowboard. Only the scaled view (b) has detected all the snowboards. The aggregated view (e) combines the information of all images (with NMS) to produce the final candidate labels. Note that images (b), (c) and (d) are transformed back to the original space.
Figure 3: Histograms for different parameters.
Figure 4: Score histograms for a single class ($\tau=0.7$) (a), and for all the classes ($\tau=0.75$) (b).
Figure 5: Comparison between the different learning curves of student and refined models w.r.t. the batch size configuration. The vertical dashed line indicate when the refinement step begins.

Adaptive Self-Training for Object Detection

TL;DR

Abstract

Adaptive Self-Training for Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (5)