Improving Bird's Eye View Semantic Segmentation by Task Decomposition
Tianhao Zhao, Yongcan Chen, Yu Wu, Tianyang Liu, Bo Du, Peilun Xiao, Shi Qiu, Hongda Yang, Guozhen Li, Yi Yang, Yutian Lin
TL;DR
This work tackles monocular BEV semantic segmentation by decomposing the traditional end-to-end pipeline into two focused stages: learning a robust BEV prior via a polar-coordinate BEV autoencoder and aligning RGB features to the BEV latent space through a column-wise transformer. The BEV maps are transformed from Cartesian to polar coordinates to enable column-wise correspondence with perspective views, while the autoencoder is trained with corrupted latent representations to enforce learning of realistic BEV patterns. The second stage maps RGB images into the BEV latent space and decodes with a frozen BEV decoder, with a subsequent fine-tuning step to better align distributions between RGB inputs and BEV predictions. Experiments on nuScenes and Argoverse demonstrate superior accuracy and efficiency compared to end-to-end and depth-based baselines, validating the approach's effectiveness in challenging cross-view scenarios.
Abstract
Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline, directly predicting the BEV segmentation map from monocular RGB inputs. However, the challenge arises when the RGB inputs and BEV targets from distinct perspectives, making the direct point-to-point predicting hard to optimize. In this paper, we decompose the original BEV segmentation task into two stages, namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage, we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation, which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage, directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps, equipping the model to handle intricate and challenging scenes effectively. Besides, we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover, our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe.
