Table of Contents
Fetching ...

Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni, Naresh Manwani

TL;DR

The paper tackles noisy pseudo-labels from vision-language models for real-time autonomous driving by introducing a per-object co-teaching framework that trains two YOLO detectors using each other’s anchor-level losses to filter unreliable boxes. Pseudo-labels are generated offline by OWLv2 and refined during training with a curriculum-based forget-rate, enabling robust learning without extensive human annotation. Across KITTI, ACDC, and BDD100K, the method substantially outperforms a baseline trained on pseudo-labels, with notable gains (e.g., KITTI improving from ~31% to ~47% mAP@0.5, and up to ~77.8% mAP@0.5 when incorporating 25% ground-truth data), while preserving real-time inference via YOLOv5m. The approach scales with unlabeled data and benefits from even small amounts of ground-truth labels, offering a practical pathway to deploy robust, open-vocabulary detectors in autonomous driving without prohibitive labeling costs.

Abstract

Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12\%$ to $46.61\%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10\%$) leads to further performance gains, reaching $57.97\%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching

TL;DR

The paper tackles noisy pseudo-labels from vision-language models for real-time autonomous driving by introducing a per-object co-teaching framework that trains two YOLO detectors using each other’s anchor-level losses to filter unreliable boxes. Pseudo-labels are generated offline by OWLv2 and refined during training with a curriculum-based forget-rate, enabling robust learning without extensive human annotation. Across KITTI, ACDC, and BDD100K, the method substantially outperforms a baseline trained on pseudo-labels, with notable gains (e.g., KITTI improving from ~31% to ~47% mAP@0.5, and up to ~77.8% mAP@0.5 when incorporating 25% ground-truth data), while preserving real-time inference via YOLOv5m. The approach scales with unlabeled data and benefits from even small amounts of ground-truth labels, offering a practical pathway to deploy robust, open-vocabulary detectors in autonomous driving without prohibitive labeling costs.

Abstract

Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers' per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ( to ) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels () leads to further performance gains, reaching mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.

Paper Structure

This paper contains 33 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Pipeline for training robust, open vocabulary, real-time object detectors
  • Figure 2: Comparison of predictions made with vanilla YOLO trained and YOLO trained with our method.