PCT: Perspective Cue Training Framework for Multi-Camera BEV Segmentation
Haruya Ishikawa, Takumi Iida, Yoshinori Konishi, Yoshimitsu Aoki
TL;DR
This paper tackles the challenge of limited BEV annotations and domain shifts in multi-camera BEV segmentation by introducing Perspective Cue Training (PCT), which exploits unlabeled perspective-view images via a PV task head trained jointly with the BEV segmentation head. PCT leverages pseudo-labels from public semantic segmentation models (e.g., Mask2Former, DeepLabV3+) to supervise the PV task, enabling a training-time loss $L_{PCT} = L_{BEV} + \lambda_{PV} L_{PV}$ without increasing inference costs. For SSL, a Mean-Teacher framework with weak/strong augmentations and a BEV Feature Dropout (BFD) provides robust consistency losses, yielding $L_{SSL} = L_{BEV} + \lambda_{PV} L_{PV} + \lambda_{Strong} L_{Strong} + \lambda_{BFD} L_{BFD}$; for UDA, CamDrop perturbs camera inputs to encourage domain-invariant representations via $L_{UDA} = L_{BEV} + \lambda_{PV} L_{PV}$. Empirical results on nuScenes and Argoverse2 show that PCT, especially when combined with CamDrop and BFD, outperforms strong baselines and existing UDA methods, demonstrating substantial improvements in SSL and UDA for camera-based BEV segmentation and reducing annotation costs in real-world deployments.
Abstract
Generating annotations for bird's-eye-view (BEV) segmentation presents significant challenges due to the scenes' complexity and the high manual annotation cost. In this work, we address these challenges by leveraging the abundance of unlabeled data available. We propose the Perspective Cue Training (PCT) framework, a novel training framework that utilizes pseudo-labels generated from unlabeled perspective images using publicly available semantic segmentation models trained on large street-view datasets. PCT applies a perspective view task head to the image encoder shared with the BEV segmentation head, effectively utilizing the unlabeled data to be trained with the generated pseudo-labels. Since image encoders are present in nearly all camera-based BEV segmentation architectures, PCT is flexible and applicable to various existing BEV architectures. PCT can be applied to various settings where unlabeled data is available. In this paper, we applied PCT for semi-supervised learning (SSL) and unsupervised domain adaptation (UDA). Additionally, we introduce strong input perturbation through Camera Dropout (CamDrop) and feature perturbation via BEV Feature Dropout (BFD), which are crucial for enhancing SSL capabilities using our teacher-student framework. Our comprehensive approach is simple and flexible but yields significant improvements over various baselines for SSL and UDA, achieving competitive performances even against the current state-of-the-art.
