Table of Contents
Fetching ...

Improved Single Camera BEV Perception Using Multi-Camera Training

Daniel Busch, Ido Freeman, Richard Meyes, Tobias Meisen

TL;DR

This work tackles BEV perception for autonomous driving under a cost-constrained setting by enabling strong single-camera inference through training-time multi-camera strategies. It combines BEVFormer with inverse block masking, a cyclic learning rate schedule, and a BEV feature reconstruction loss to supervise the transition from six-camera training to one-camera inference. The integrated approach yields notable gains on nuScenes, with approximately a 20% increase in NDS, 25% in mAP, and 19% in mIoU over baselines, while reducing hallucinations in the BEV map. The results suggest that carefully designed training-time modalities can substantially close the gap between surround-view performance and single-front-camera perception, enabling cheaper production sensors without sacrificing BEV quality.

Abstract

Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.

Improved Single Camera BEV Perception Using Multi-Camera Training

TL;DR

This work tackles BEV perception for autonomous driving under a cost-constrained setting by enabling strong single-camera inference through training-time multi-camera strategies. It combines BEVFormer with inverse block masking, a cyclic learning rate schedule, and a BEV feature reconstruction loss to supervise the transition from six-camera training to one-camera inference. The integrated approach yields notable gains on nuScenes, with approximately a 20% increase in NDS, 25% in mAP, and 19% in mIoU over baselines, while reducing hallucinations in the BEV map. The results suggest that carefully designed training-time modalities can substantially close the gap between surround-view performance and single-front-camera perception, enabling cheaper production sensors without sacrificing BEV quality.

Abstract

Bird's Eye View (BEV) map prediction is essential for downstream autonomous driving tasks like trajectory prediction. In the past, this was accomplished through the use of a sophisticated sensor configuration that captured a surround view from multiple cameras. However, in large-scale production, cost efficiency is an optimization goal, so that using fewer cameras becomes more relevant. But the consequence of fewer input images correlates with a performance drop. This raises the problem of developing a BEV perception model that provides a sufficient performance on a low-cost sensor setup. Although, primarily relevant for inference time on production cars, this cost restriction is less problematic on a test vehicle during training. Therefore, the objective of our approach is to reduce the aforementioned performance drop as much as possible using a modern multi-camera surround view model reduced for single-camera inference. The approach includes three features, a modern masking technique, a cyclic Learning Rate (LR) schedule, and a feature reconstruction loss for supervising the transition from six-camera inputs to one-camera input during training. Our method outperforms versions trained strictly with one camera or strictly with six-camera surround view for single-camera inference resulting in reduced hallucination and better quality of the BEV map.
Paper Structure (22 sections, 5 figures, 3 tables)

This paper contains 22 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: BEVFormer architecture li_bevformer_2022 extended with the feature reconstruction method. Left: First-step input and second-step input with noise masking. Midsection: Backbone and Transformer layers with Temporal Self-Attention into History BEV and Spatial Cross-Attention with re-projection into the 2D features from the backbone. Additionally, the Feature Reconstruction loss over the BEV features embeddings from the first and second steps. Right: Heads and output samples.
  • Figure 2: Cyclic LR schedule (blue) and mean for masking ratio (green) over the training epochs. The masking ratio refers only to the five non-front-facing cameras.
  • Figure 3: Sample of the inverse block masking with a masking ratio of $\mu=0.4$ and variance $\sigma=0.2$. The front view (blue frame) is not masked.
  • Figure 4: Results of one sample on two baselines the first one is trained on one camera, the second one is trained on six cameras and results from our method. The inference for all runs is done on one camera. Left: The GT segmentation map. Center: The predicted BEV map with projected bounding boxes (GT=green; prediction=blue; masked view=grey).
  • Figure 5: BEV maps and two visualized channels of the latent space BEV feature representation from a six-camera training baseline (a) with six-camera inference and (b) with single-camera inference. In addition, map and features from our training method (c) with six-camera inference (d) with single-camera inference. The gray cover indicates the masked views. The warmer the colors the higher the values.