Table of Contents
Fetching ...

VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization

Yiwei Zhang, Jin Gao, Fudong Ge, Guan Luo, Bing Li, Zhaoxiang Zhang, Haibin Ling, Weiming Hu

TL;DR

This paper proposes to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space and generates high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV.

Abstract

Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62.2/47.6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73.4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https://github.com/Z1zyw/VQ-Map}.

VQ-Map: Bird's-Eye-View Map Layout Estimation in Tokenized Discrete Space via Vector Quantization

TL;DR

This paper proposes to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space and generates high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV.

Abstract

Bird's-eye-view (BEV) map layout estimation requires an accurate and full understanding of the semantics for the environmental elements around the ego car to make the results coherent and realistic. Due to the challenges posed by occlusion, unfavourable imaging conditions and low resolution, \emph{generating} the BEV semantic maps corresponding to corrupted or invalid areas in the perspective view (PV) is appealing very recently. \emph{The question is how to align the PV features with the generative models to facilitate the map estimation}. In this paper, we propose to utilize a generative model similar to the Vector Quantized-Variational AutoEncoder (VQ-VAE) to acquire prior knowledge for the high-level BEV semantics in the tokenized discrete space. Thanks to the obtained BEV tokens accompanied with a codebook embedding encapsulating the semantics for different BEV elements in the groundtruth maps, we are able to directly align the sparse backbone image features with the obtained BEV tokens from the discrete representation learning based on a specialized token decoder module, and finally generate high-quality BEV maps with the BEV codebook embedding serving as a bridge between PV and BEV. We evaluate the BEV map layout estimation performance of our model, termed VQ-Map, on both the nuScenes and Argoverse benchmarks, achieving 62.2/47.6 mean IoU for surround-view/monocular evaluation on nuScenes, as well as 73.4 IoU for monocular evaluation on Argoverse, which all set a new record for this map layout estimation task. The code and models are available on \url{https://github.com/Z1zyw/VQ-Map}.

Paper Structure

This paper contains 13 sections, 6 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: We showcase the prediction results in various environmental conditions (day, rainy and night from top to bottom). Our VQ-Map produces more reasonable results, even for areas that are not directly visible, while significantly reducing artifacts. Color scheme is the same as in liu2023bevfusion.
  • Figure 2: VQ-Map employs a generative model similar to the VQ-VAE framework to encode the BEV groundtruth maps into BEV tokens accompanied with a codebook embedding. After the generative model training, the BEV tokens serve as the classification labels to supervise the PV feature learning via a specialized token decoder module. During inference, VQ-Map utilizes the predicted BEV tokens to generate high-quality BEV map layouts based on the off-the-shelf codebook embedding and the BEV map generation decoder.
  • Figure 3: Visualization of the BEV codebook embedding by showing the BEV patch images corresponding to the specific BEV tokens. All BEV patch images in the same column correspond to the same token. The data is randomly sampled from the nuScenes validation dataset. Color scheme is the same as in liu2023bevfusion.
  • Figure 4: Architecture of Our Token Decoder. Pos refers to the positional embedding, and $M$ indicates the layer number.
  • Figure A1: More visualization results for sorround-view BEV map layout estimation on nuScenes.
  • ...and 3 more figures