DINOSTAR: Deep Iterative Neural Object Detector Self-Supervised Training for Roadside LiDAR Applications
Muhammad Shahbaz, Shaurya Agarwal
TL;DR
This work tackles the high labeling burden in roadside LiDAR object detection by introducing a self-supervised teacher–student framework. Multiple statistically modeled teachers generate noisy annotations through background filtering, clustering, and heuristic bounding-boxes, which train a robust student detector without human labeling. The approach demonstrates competitive performance with supervised detectors on public roadside datasets and emphasizes data augmentation across locations and perspectives. Its scalable, autonomous annotation pipeline has significant practical impact for deploying roadside perception systems at scale while reducing labeling costs.
Abstract
Recent advancements in deep-learning methods for object detection in point-cloud data have enabled numerous roadside applications, fostering improvements in transportation safety and management. However, the intricate nature of point-cloud data poses significant challenges for human-supervised labeling, resulting in substantial expenditures of time and capital. This paper addresses the issue by developing an end-to-end, scalable, and self-supervised framework for training deep object detectors tailored for roadside point-cloud data. The proposed framework leverages self-supervised, statistically modeled teachers to train off-the-shelf deep object detectors, thus circumventing the need for human supervision. The teacher models follow fine-tuned set standard practices of background filtering, object clustering, bounding-box fitting, and classification to generate noisy labels. It is presented that by training the student model over the combined noisy annotations from multitude of teachers enhances its capacity to discern background/foreground more effectively and forces it to learn diverse point-cloud-representations for object categories of interest. The evaluations, involving publicly available roadside datasets and state-of-art deep object detectors, demonstrate that the proposed framework achieves comparable performance to deep object detectors trained on human-annotated labels, despite not utilizing such human-annotations in its training process.
