Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Junyu Zhu; Lina Liu; Yu Tang; Feng Wen; Wanlong Li; Yong Liu

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Junyu Zhu, Lina Liu, Yu Tang, Feng Wen, Wanlong Li, Yong Liu

TL;DR

This paper tackles the high labeling burden of visual BEV semantic segmentation by introducing a semi-supervised framework that leverages unlabeled images through segmentation and BEV feature consistency losses within a Mean-Teacher setup. It adds a novel conjoint rotation augmentation that preserves the geometric relationships between front-view images and BEV maps, enhancing data diversity without destroying structure. Experiments on nuScenes show significant performance gains over fully supervised baselines across label ratios, establishing the first strong demonstration of unlabeled data helping visual BEV segmentation. The approach offers practical impact by reducing annotation costs while improving BEV understanding for autonomous systems.

Abstract

Visual bird's eye view (BEV) semantic segmentation helps autonomous vehicles understand the surrounding environment only from images, including static elements (e.g., roads) and dynamic elements (e.g., vehicles, pedestrians). However, the high cost of annotation procedures of full-supervised methods limits the capability of the visual BEV semantic segmentation, which usually needs HD maps, 3D object bounding boxes, and camera extrinsic matrixes. In this paper, we present a novel semi-supervised framework for visual BEV semantic segmentation to boost performance by exploiting unlabeled images during the training. A consistency loss that makes full use of unlabeled data is then proposed to constrain the model on not only semantic prediction but also the BEV feature. Furthermore, we propose a novel and effective data augmentation method named conjoint rotation which reasonably augments the dataset while maintaining the geometric relationship between the front-view images and the BEV semantic segmentation. Extensive experiments on the nuScenes and Argoverse datasets show that our semi-supervised framework can effectively improve prediction accuracy. To the best of our knowledge, this is the first work that explores improving visual BEV semantic segmentation performance using unlabeled data. The code is available at https://github.com/Junyu-Z/Semi-BEVseg

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 6 figures, 6 tables)

This paper contains 19 sections, 7 equations, 6 figures, 6 tables.

Introduction
Related works
Visual BEV Semantic Segmentation
Semi-Supervised 2D Semantic Segmentation
Data Augmentation
Method
Visual BEV semantic segmentation Model
Supervised Loss
Segmentation Consistency Loss
BEV Feature Consistency Loss
Conjoint Rotation for Data Augmentation
Training Process
Experiments
Datasets
Network Architecture
...and 4 more sections

Figures (6)

Figure 1: mIoU(%) on the nuScenes dataset between our semi-supervised framework and supervised baseline using different label ratios.
Figure 2: Framework overview. By our proposed conjoint rotation, the labeled and unlabeled data are first augmented to get $I_{L}$, $\hat{Y}$, and $I_{U}$. Immediately after that, $Y$ and $Y_{U}$ are predicted by the Student Net $M_{S}$ and Teacher Net $M_{T}$, respectively. Meanwhile, $M_{S}$ predicted $Y^{'}_{U}$ from flipped image $I^{'}_{U}$. Note that the view transformer of $M_{S}$ and $M_{T}$ needs the camera intrinsic matrix $K$ as input, and $K$ would also be changed when flipping the image. The feature consistency loss $L_{fc}$ is computed from the L2 loss of BEV features of $I_{U}$ and $I^{'}_{U}$. And the segmentation consistency loss $L_{sc}$ is computed from the L2 loss of BEV semantic segmentation, $Y_{U}$ and $Y^{'}_{U}$. Also, the supervised loss $L_{sup}$ is computed between $Y$ and $\hat{Y}$. After $M_{S}$ is updated with gradient descent using the above losses, $M_{T}$ is updated as an exponential moving average (EMA) of $M_{S}$. The Teacher Net can perform better than the Student Net after the training with proper hyper-parameters.
Figure 3: Illustration of conjoint rotation.
Figure 4: Qualitative results with 20% labels. We follow the color scheme in PON PON and use the visibility mask (black) for visualization.
Figure 5: Different border modes. (a)Original FV image. Augmented FV image with (b)zero border, (c)reflect border, and (d)replicate border.
...and 1 more figures

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

TL;DR

Abstract

Semi-Supervised Learning for Visual Bird's Eye View Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)