Table of Contents
Fetching ...

Addressing Diverging Training Costs using BEVRestore for High-resolution Bird's Eye View Map Construction

Minsu Kim, Giseop Kim, Sunwook Choi

TL;DR

The paper addresses the diverging training costs encountered when learning high-resolution BEV maps for urban scene understanding by introducing BEVRestore, a plug-and-play module that learns representations in a memory-efficient low-resolution BEV space and restores them to high resolution using a learnable up-sampling operator. BEVRestore fuses LR BEV features from multiple sensors, then up-samples via a restoration network composed of a learnable module $f_\varphi$ and Pixel Shuffle $\mathcal{PS}$, preserving BEV scope while increasing resolution. Extensive experiments on nuScenes demonstrate significant improvements in BEV segmentation across camera, LiDAR, and LiDAR-camera fusion (e.g., $4.3\%, 10.1\%, 8.2\%$ mIoU gains respectively) and favorable results for HD map construction, with BEVRestore providing memory efficiency and controlled latency. The work establishes BEVRestore as a flexible, compatible approach that enables accurate HR BEV map construction with reduced computational bottlenecks, advancing safe autonomous driving perception.

Abstract

Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky motion planning like collision avoidance, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel BEVRestore mechanism. Specifically, our proposed model encodes the features of each sensor to LR BEV space and restores them to HR space to establish a memory-efficient map constructor. To this end, we introduce the BEV restoration strategy, which restores aliasing, and blocky artifacts of the up-scaled BEV features, and narrows down the width of the labels. Our extensive experiments show that the proposed mechanism provides a plug-and-play, memory-efficient pipeline, enabling an HR map construction with a broad BEV scope.

Addressing Diverging Training Costs using BEVRestore for High-resolution Bird's Eye View Map Construction

TL;DR

The paper addresses the diverging training costs encountered when learning high-resolution BEV maps for urban scene understanding by introducing BEVRestore, a plug-and-play module that learns representations in a memory-efficient low-resolution BEV space and restores them to high resolution using a learnable up-sampling operator. BEVRestore fuses LR BEV features from multiple sensors, then up-samples via a restoration network composed of a learnable module and Pixel Shuffle , preserving BEV scope while increasing resolution. Extensive experiments on nuScenes demonstrate significant improvements in BEV segmentation across camera, LiDAR, and LiDAR-camera fusion (e.g., mIoU gains respectively) and favorable results for HD map construction, with BEVRestore providing memory efficiency and controlled latency. The work establishes BEVRestore as a flexible, compatible approach that enables accurate HR BEV map construction with reduced computational bottlenecks, advancing safe autonomous driving perception.

Abstract

Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky motion planning like collision avoidance, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel BEVRestore mechanism. Specifically, our proposed model encodes the features of each sensor to LR BEV space and restores them to HR space to establish a memory-efficient map constructor. To this end, we introduce the BEV restoration strategy, which restores aliasing, and blocky artifacts of the up-scaled BEV features, and narrows down the width of the labels. Our extensive experiments show that the proposed mechanism provides a plug-and-play, memory-efficient pipeline, enabling an HR map construction with a broad BEV scope.
Paper Structure (15 sections, 6 equations, 7 figures, 6 tables)

This paper contains 15 sections, 6 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: "m/px" means meter per pixel. Training with HR BEV features consumes a huge memory size due to the bulky architecture. Our plug-and-play BEVRestore (\ref{['subfig:bevrestore']}) addresses the issue, allowing for costly efficient map construction (\ref{['subfig:cost']}) and enhancing feature encoding (\ref{['subfig:performance']}).
  • Figure 2: Overview of Our Proposed Methods. Our suggested mechanism take in LiDAR ($\mathbf{z}^{LR}_p$) and Camera ($\mathbf{z}^{LR}_i$) BEV features. After the cross-modal fusion neck ($\mathbf{B}_\psi$) fuse and enhance them, our BEVRestore ($S$) up-samples and restores a unified HR BEV feature. Then decoding CNNs ($\mathbf{D}_\phi$) estimate an HR semantic map.
  • Figure 3: Comparison on increasing costs to BEVFusion.
  • Figure 4: Comparison of BEVRestore Restoration to conventional methods. We use a (-50m, 50m) BEV scope with 2.0m/px LR BEV and 0.5m/px HR BEV resolutions. The hand-crafted Nearest, Bilinear, and Bicubic methods suffer from the incomplete restoration of aliasing and blocky artifacts, compared to the BEVRestore, which utilizes learnable restoration.
  • Figure 5: Restoration Comparisons on varying up-sampling factors. We use a (-50m, 50m) BEV scope with 0.5m/px, 1.0m/px, 2.0m/px, 4.0m/px LR BEV resolutions for $\times 1, \times 2, \times 4$, and $\times 8$ models, respectively. Their HR BEV space utilizes 0.5m/px HR resolution. The $\times 1$ and $\times 2$ BEVRestores fail to learn high-level scene features (Black ROIs). In contrast, $\times 8$ BEVRestore suffers from inaccurate prediction (Green ROIs) caused by severe loss of prior, compared to the $\times 4$ BEVRestore.
  • ...and 2 more figures