Improved Single Camera BEV Perception Using Multi-Camera Training
Daniel Busch, Ido Freeman, Richard Meyes, Tobias Meisen
TL;DR
This work tackles BEV perception for autonomous driving under a cost-constrained setting by enabling strong single-camera inference through training-time multi-camera strategies. It combines BEVFormer with inverse block masking, a cyclic learning rate schedule, and a BEV feature reconstruction loss to supervise the transition from six-camera training to one-camera inference. The integrated approach yields notable gains on nuScenes, with approximately a 20% increase in NDS, 25% in mAP, and 19% in mIoU over baselines, while reducing hallucinations in the BEV map. The results suggest that carefully designed training-time modalities can substantially close the gap between surround-view performance and single-front-camera perception, enabling cheaper production sensors without sacrificing BEV quality.
Abstract
Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.
