Table of Contents
Fetching ...

CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

Yushan Han, Hui Zhang, Honglei Zhang, Jing Wang, Yidong Li

TL;DR

CoDTS tackles sparse supervision in collaborative perception by introducing a dual teacher-student framework that combines Main Foreground Mining and Supplement Foreground Mining with Adaptive Thresholding and Neighbor Anchor Sampling. The two-stage training strategy (warm-up and refinement) enables mutual learning between the student and a dynamic teacher, yielding pseudo labels that are both high in quality and abundant in quantity. Across four large-scale datasets, CoDTS consistently surpasses prior sparsely supervised methods and approaches full-supervision performance, while also outperforming several semi-supervised baselines, demonstrating strong practical impact for cost-effective, robust multi-agent perception.

Abstract

Current collaborative perception methods often rely on fully annotated datasets, which can be expensive to obtain in practical situations. To reduce annotation costs, some works adopt sparsely supervised learning techniques and generate pseudo labels for the missing instances. However, these methods fail to achieve an optimal confidence threshold that harmonizes the quality and quantity of pseudo labels. To address this issue, we propose an end-to-end Collaborative perception Dual Teacher-Student framework (CoDTS), which employs adaptive complementary learning to produce both high-quality and high-quantity pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates high-quality pseudo labels based on the prediction of the static teacher. Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance between the quality and quantity of pseudo labels by adaptively identifying missing instances based on the prediction of the dynamic teacher. Additionally, the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the representation of pseudo labels. To promote the adaptive complementary learning, we implement a staged training strategy that trains the student and dynamic teacher in a mutually beneficial manner. Extensive experiments demonstrate that the CoDTS effectively ensures an optimal balance of pseudo labels in both quality and quantity, establishing a new state-of-the-art in sparsely supervised collaborative perception. The code is available at https://github.com/CatOneTwo/CoDTS.

CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a Dual Teacher-Student Framework

TL;DR

CoDTS tackles sparse supervision in collaborative perception by introducing a dual teacher-student framework that combines Main Foreground Mining and Supplement Foreground Mining with Adaptive Thresholding and Neighbor Anchor Sampling. The two-stage training strategy (warm-up and refinement) enables mutual learning between the student and a dynamic teacher, yielding pseudo labels that are both high in quality and abundant in quantity. Across four large-scale datasets, CoDTS consistently surpasses prior sparsely supervised methods and approaches full-supervision performance, while also outperforming several semi-supervised baselines, demonstrating strong practical impact for cost-effective, robust multi-agent perception.

Abstract

Current collaborative perception methods often rely on fully annotated datasets, which can be expensive to obtain in practical situations. To reduce annotation costs, some works adopt sparsely supervised learning techniques and generate pseudo labels for the missing instances. However, these methods fail to achieve an optimal confidence threshold that harmonizes the quality and quantity of pseudo labels. To address this issue, we propose an end-to-end Collaborative perception Dual Teacher-Student framework (CoDTS), which employs adaptive complementary learning to produce both high-quality and high-quantity pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates high-quality pseudo labels based on the prediction of the static teacher. Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance between the quality and quantity of pseudo labels by adaptively identifying missing instances based on the prediction of the dynamic teacher. Additionally, the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the representation of pseudo labels. To promote the adaptive complementary learning, we implement a staged training strategy that trains the student and dynamic teacher in a mutually beneficial manner. Extensive experiments demonstrate that the CoDTS effectively ensures an optimal balance of pseudo labels in both quality and quantity, establishing a new state-of-the-art in sparsely supervised collaborative perception. The code is available at https://github.com/CatOneTwo/CoDTS.

Paper Structure

This paper contains 34 sections, 7 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: (a)-(b): Labels of fully and sparsely supervised collaborative perception. (c)-(g): The impact of different confidence thresholds on pseudo labels. The false prediction ratio (FPR) = the number of false predictions / the number of pseudo labels, and the missing prediction ratio (MPR) = the number of missed predictions / the number of ground truths.
  • Figure 2: The CoDTS framework employs a staged training strategy. (a) In the warm-up stage, the MFM utilizes a low threshold to generate a sufficient amount of pseudo labels, which are used to guide the student and pre-train the dynamic teacher. (b) In the refinement stage, the MFM raises the threshold to obtain high-quality pseudo labels, while the SFM utilizes dynamic threshold adaptation to adjust the threshold and adaptively identify missing instances to complement the MFM. This Adaptive Complementary Learning ensures the generation of both high-quality and high-quantity pseudo labels, which are then merged to guide the student. Throughout both stages, the NAS is used to enhance the representations of pseudo labels.
  • Figure 3: Visualization of pseudo label generation through adaptive complementary learning. (a)-(c) represent the pseudo labels generated by MFM, SFM, and their merged results, respectively.
  • Figure 4: The process of NAS. First, we generate bounding boxes for positive instances. We then select neighbor instances that have a high overlap with these bounding boxes. Finally, the positive instances are combined with the neighbor instances to create dense positive instances.
  • Figure 5: Ablation study of staged training strategy (STT) on DAIR-V2X.
  • ...and 4 more figures