Table of Contents
Fetching ...

CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation

Jeongbin Hong, Dooseop Choi, Taeg-Hyun An, Kyounghwan An, Kyoung-Wook Min

TL;DR

A new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation and uses an inverse view transformation (IVT) network to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images.

Abstract

Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements -- with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively -- without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.

CycleBEV: Regularizing View Transformation Networks via View Cycle Consistency for Bird's-Eye-View Semantic Segmentation

TL;DR

A new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation and uses an inverse view transformation (IVT) network to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images.

Abstract

Transforming image features from perspective view (PV) space to bird's-eye-view (BEV) space remains challenging in autonomous driving due to depth ambiguity and occlusion. Although several view transformation (VT) paradigms have been proposed, the challenge still remains. In this paper, we propose a new regularization framework, dubbed CycleBEV, that enhances existing VT models for BEV semantic segmentation. Inspired by cycle consistency, widely used in image distribution modeling, we devise an inverse view transformation (IVT) network that maps BEV segmentation maps back to PV segmentation maps and use it to regularize VT networks during training through cycle consistency losses, enabling them to capture richer semantic and geometric information from input PV images. To further exploit the capacity of the IVT network, we introduce two novel ideas that extend cycle consistency into geometric and representation spaces. We evaluate CycleBEV on four representative VT models covering three major paradigms using the large-scale nuScenes dataset. Experimental results show consistent improvements -- with gains of up to 0.74, 4.86, and 3.74 mIoU for drivable area, vehicle, and pedestrian classes, respectively -- without increasing inference complexity, since the IVT network is used only during training. The implementation code is available at https://github.com/JeongbinHong/CycleBEV.
Paper Structure (31 sections, 5 equations, 10 figures, 10 tables)

This paper contains 31 sections, 5 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Visualization of model architectures. (a) CVTM Yang_cvpr21, (b) FocusBEV Zhao_arxiv24, and (c) Proposed. Modules outside the green boxes are used only during training. Note that we exclude image feature extraction modules for simplicity. While CVTM and FocusBEV integrate the BEV2PV module, increasing computational cost and network size, the proposed approach employs it only during training. Furthermore, CVTM enforces CC in feature space (semantically vague), and FocusBEV applies feature-space projections without an explicit CC loss (semantically vague and unenforced). As a result, BEV predictions are not directly constrained. In contrast, we enforce semantic-level BEV→PV consistency using a training-only IVT that directly regularizes BEV predictions.
  • Figure 2: Illustration of the proposed dual-branch IVT network architecture.
  • Figure 3: Visualization of the proposed regularization framework.
  • Figure 4: Prediction results. (a) Input images and their corresponding ground-truth BEV maps, (b) BEV map prediction results. In (b), the first row shows the predictions from the four baseline models. The second, third, and fourth rows show the results when CVTM Yang_cvpr21, FocusBEV Zhao_arxiv24, and Ours are applied to the baseline models, respectively. Drivable area, vehicle, and pedestrian are color-coded with gray, blue, and red, respectively. Please zoom in for better visibility.
  • Figure 5: Prediction examples on a scene with occluded vehicles. (a) Input images (the first column), ground-truth PV segmentation maps (the second column), and PV segmentation maps predicted by the proposed IVT network (the third column). (b) Ground-truth BEV map (the first column), BEV map predicted by BEVFormer (the second column), and BEV map predicted by BEVFormer+Ours (the third column). The green boxes indicate the AV.
  • ...and 5 more figures