Table of Contents
Fetching ...

PCT: Perspective Cue Training Framework for Multi-Camera BEV Segmentation

Haruya Ishikawa, Takumi Iida, Yoshinori Konishi, Yoshimitsu Aoki

TL;DR

This paper tackles the challenge of limited BEV annotations and domain shifts in multi-camera BEV segmentation by introducing Perspective Cue Training (PCT), which exploits unlabeled perspective-view images via a PV task head trained jointly with the BEV segmentation head. PCT leverages pseudo-labels from public semantic segmentation models (e.g., Mask2Former, DeepLabV3+) to supervise the PV task, enabling a training-time loss $L_{PCT} = L_{BEV} + \lambda_{PV} L_{PV}$ without increasing inference costs. For SSL, a Mean-Teacher framework with weak/strong augmentations and a BEV Feature Dropout (BFD) provides robust consistency losses, yielding $L_{SSL} = L_{BEV} + \lambda_{PV} L_{PV} + \lambda_{Strong} L_{Strong} + \lambda_{BFD} L_{BFD}$; for UDA, CamDrop perturbs camera inputs to encourage domain-invariant representations via $L_{UDA} = L_{BEV} + \lambda_{PV} L_{PV}$. Empirical results on nuScenes and Argoverse2 show that PCT, especially when combined with CamDrop and BFD, outperforms strong baselines and existing UDA methods, demonstrating substantial improvements in SSL and UDA for camera-based BEV segmentation and reducing annotation costs in real-world deployments.

Abstract

Generating annotations for bird's-eye-view (BEV) segmentation presents significant challenges due to the scenes' complexity and the high manual annotation cost. In this work, we address these challenges by leveraging the abundance of unlabeled data available. We propose the Perspective Cue Training (PCT) framework, a novel training framework that utilizes pseudo-labels generated from unlabeled perspective images using publicly available semantic segmentation models trained on large street-view datasets. PCT applies a perspective view task head to the image encoder shared with the BEV segmentation head, effectively utilizing the unlabeled data to be trained with the generated pseudo-labels. Since image encoders are present in nearly all camera-based BEV segmentation architectures, PCT is flexible and applicable to various existing BEV architectures. PCT can be applied to various settings where unlabeled data is available. In this paper, we applied PCT for semi-supervised learning (SSL) and unsupervised domain adaptation (UDA). Additionally, we introduce strong input perturbation through Camera Dropout (CamDrop) and feature perturbation via BEV Feature Dropout (BFD), which are crucial for enhancing SSL capabilities using our teacher-student framework. Our comprehensive approach is simple and flexible but yields significant improvements over various baselines for SSL and UDA, achieving competitive performances even against the current state-of-the-art.

PCT: Perspective Cue Training Framework for Multi-Camera BEV Segmentation

TL;DR

This paper tackles the challenge of limited BEV annotations and domain shifts in multi-camera BEV segmentation by introducing Perspective Cue Training (PCT), which exploits unlabeled perspective-view images via a PV task head trained jointly with the BEV segmentation head. PCT leverages pseudo-labels from public semantic segmentation models (e.g., Mask2Former, DeepLabV3+) to supervise the PV task, enabling a training-time loss without increasing inference costs. For SSL, a Mean-Teacher framework with weak/strong augmentations and a BEV Feature Dropout (BFD) provides robust consistency losses, yielding ; for UDA, CamDrop perturbs camera inputs to encourage domain-invariant representations via . Empirical results on nuScenes and Argoverse2 show that PCT, especially when combined with CamDrop and BFD, outperforms strong baselines and existing UDA methods, demonstrating substantial improvements in SSL and UDA for camera-based BEV segmentation and reducing annotation costs in real-world deployments.

Abstract

Generating annotations for bird's-eye-view (BEV) segmentation presents significant challenges due to the scenes' complexity and the high manual annotation cost. In this work, we address these challenges by leveraging the abundance of unlabeled data available. We propose the Perspective Cue Training (PCT) framework, a novel training framework that utilizes pseudo-labels generated from unlabeled perspective images using publicly available semantic segmentation models trained on large street-view datasets. PCT applies a perspective view task head to the image encoder shared with the BEV segmentation head, effectively utilizing the unlabeled data to be trained with the generated pseudo-labels. Since image encoders are present in nearly all camera-based BEV segmentation architectures, PCT is flexible and applicable to various existing BEV architectures. PCT can be applied to various settings where unlabeled data is available. In this paper, we applied PCT for semi-supervised learning (SSL) and unsupervised domain adaptation (UDA). Additionally, we introduce strong input perturbation through Camera Dropout (CamDrop) and feature perturbation via BEV Feature Dropout (BFD), which are crucial for enhancing SSL capabilities using our teacher-student framework. Our comprehensive approach is simple and flexible but yields significant improvements over various baselines for SSL and UDA, achieving competitive performances even against the current state-of-the-art.
Paper Structure (15 sections, 9 equations, 5 figures, 7 tables)

This paper contains 15 sections, 9 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: General overview of our proposed Perspective Cue Training (PCT) framework and the impact it has on tasks requiring the utilization of unlabeled data. PCT framework utilizes PV pseudo-labels generated from easily accessible models (e.g. Mask2Former for semantic segmentation) to train multi-camera BEV segmentation models. PCT is flexible and applicable to various BEV architectures. Our method significantly improves the baseline for both SSL and UDA tasks utilizing unlabeled data.
  • Figure 2: Visualization of pseudo-labels generated by easily accessible models on the nuScenes dataset. We generate semantic segmentation pseudo-labels from pretrained models; namely DeepLabV3+ Chen2018DLV3, SegFormer Xie2021SegFormer, and Mask2Former Cheng2021Mask2Former. Out of the predictions trained on Cityscapes Cordts2016Cityscapes (rows 2 to 4), Mask2Former exhibits the cleanest results especially for harder domains like nighttime. When trained on BDD100k Yu2020BDD100k, which has diverse scenarios such as weather and time of day, the pseudo-label are qualitatively better, where over- and under-segmentation occurs less frequently. We also explore the use of relative depth pseudo-labels obtained from Depth Anything Yang2024DepthAnything. Best viewed in color and zoomed in.
  • Figure 3: Visualization showcasing how Camera Dropout (CamDrop) augmentation is applied to perspective views and BEV ground truth (GT). Back-viewing camera out of the six cameras is dropped and subsequent areas of the BEV GT, only visible by the dropped camera, are masked out.
  • Figure 4: Our proposed semi-supervised learning (SSL) training framework utilizing the proposed PCT, Camera Dropout (CamDrop) augmentation, and BEV Feature Dropout (BFD). The BEV segmentation model jointly trained with pseudo-labels obtained from the perspective view task models using PCT. We utilize the mean-teacher (MT) framework to effectively use the proposed input and feature perturbations by enforcing consistency with the teacher model.
  • Figure 5: Qualitative results for semi-supervised learning (SSL) on the 1/16 split and unsupervised domain adaptation (UDA) on the "Day $\rightarrow$ Night" split. Best viewed in color and zoomed in.