Table of Contents
Fetching ...

Improving Bird's Eye View Semantic Segmentation by Task Decomposition

Tianhao Zhao, Yongcan Chen, Yu Wu, Tianyang Liu, Bo Du, Peilun Xiao, Shi Qiu, Hongda Yang, Guozhen Li, Yi Yang, Yutian Lin

TL;DR

This work tackles monocular BEV semantic segmentation by decomposing the traditional end-to-end pipeline into two focused stages: learning a robust BEV prior via a polar-coordinate BEV autoencoder and aligning RGB features to the BEV latent space through a column-wise transformer. The BEV maps are transformed from Cartesian to polar coordinates to enable column-wise correspondence with perspective views, while the autoencoder is trained with corrupted latent representations to enforce learning of realistic BEV patterns. The second stage maps RGB images into the BEV latent space and decodes with a frozen BEV decoder, with a subsequent fine-tuning step to better align distributions between RGB inputs and BEV predictions. Experiments on nuScenes and Argoverse demonstrate superior accuracy and efficiency compared to end-to-end and depth-based baselines, validating the approach's effectiveness in challenging cross-view scenarios.

Abstract

Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.

Improving Bird's Eye View Semantic Segmentation by Task Decomposition

TL;DR

This work tackles monocular BEV semantic segmentation by decomposing the traditional end-to-end pipeline into two focused stages: learning a robust BEV prior via a polar-coordinate BEV autoencoder and aligning RGB features to the BEV latent space through a column-wise transformer. The BEV maps are transformed from Cartesian to polar coordinates to enable column-wise correspondence with perspective views, while the autoencoder is trained with corrupted latent representations to enforce learning of realistic BEV patterns. The second stage maps RGB images into the BEV latent space and decodes with a frozen BEV decoder, with a subsequent fine-tuning step to better align distributions between RGB inputs and BEV predictions. Experiments on nuScenes and Argoverse demonstrate superior accuracy and efficiency compared to end-to-end and depth-based baselines, validating the approach's effectiveness in challenging cross-view scenarios.

Abstract

Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.
Paper Structure (15 sections, 7 equations, 6 figures, 4 tables)

This paper contains 15 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Qualitative comparison between traditional end-to-end method and our method. (a) shows the perspective view RGB image, (b) and (c) show the BEV segmentation maps predicted without and with the method of task decomposition, respectively.
  • Figure 2: Illustration of the transformation between Cartesian (left) and polar (right) coordinate system for BEV segmentation maps. This transformation can achieve a column-wise correspondence between the BEV segmentation maps and the RGB images.
  • Figure 3: Overview of our two-stage method. (a) In the first stage, a BEV autoencoder is trained by BEV segmentation maps independent from RGB images. (b) In the second stage, the BEV autoencoder is frozen and the RGB-BEV alignment is conducted to match the RGB images to BEV latent representations for decoding.
  • Figure 4: Qualitative results on nuScenes caesar2020nuscenes. We compare with other methods following the color scheme used in PON roddick2020predicting.
  • Figure 5: Trend demonstration and comparison with PON roddick2020predicting of IoU over distance (0-50m) on nuScenes caesar2020nuscenes.
  • ...and 1 more figures