Addressing Diverging Training Costs using BEVRestore for High-resolution Bird's Eye View Map Construction
Minsu Kim, Giseop Kim, Sunwook Choi
TL;DR
The paper addresses the diverging training costs encountered when learning high-resolution BEV maps for urban scene understanding by introducing BEVRestore, a plug-and-play module that learns representations in a memory-efficient low-resolution BEV space and restores them to high resolution using a learnable up-sampling operator. BEVRestore fuses LR BEV features from multiple sensors, then up-samples via a restoration network composed of a learnable module $f_\varphi$ and Pixel Shuffle $\mathcal{PS}$, preserving BEV scope while increasing resolution. Extensive experiments on nuScenes demonstrate significant improvements in BEV segmentation across camera, LiDAR, and LiDAR-camera fusion (e.g., $4.3\%, 10.1\%, 8.2\%$ mIoU gains respectively) and favorable results for HD map construction, with BEVRestore providing memory efficiency and controlled latency. The work establishes BEVRestore as a flexible, compatible approach that enables accurate HR BEV map construction with reduced computational bottlenecks, advancing safe autonomous driving perception.
Abstract
Recent advancements in Bird's Eye View (BEV) fusion for map construction have demonstrated remarkable mapping of urban environments. However, their deep and bulky architecture incurs substantial amounts of backpropagation memory and computing latency. Consequently, the problem poses an unavoidable bottleneck in constructing high-resolution (HR) BEV maps, as their large-sized features cause significant increases in costs including GPU memory consumption and computing latency, named diverging training costs issue. Affected by the problem, most existing methods adopt low-resolution (LR) BEV and struggle to estimate the precise locations of urban scene components like road lanes, and sidewalks. As the imprecision leads to risky motion planning like collision avoidance, the diverging training costs issue has to be resolved. In this paper, we address the issue with our novel BEVRestore mechanism. Specifically, our proposed model encodes the features of each sensor to LR BEV space and restores them to HR space to establish a memory-efficient map constructor. To this end, we introduce the BEV restoration strategy, which restores aliasing, and blocky artifacts of the up-scaled BEV features, and narrows down the width of the labels. Our extensive experiments show that the proposed mechanism provides a plug-and-play, memory-efficient pipeline, enabling an HR map construction with a broad BEV scope.
