Table of Contents
Fetching ...

Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers

James Gunn, Zygmunt Lenyk, Anuj Sharma, Andrea Donati, Alexandru Buburuzan, John Redford, Romain Mueller

TL;DR

This paper interrogates the role of monocular depth in camera-lidar fusion for 3D object detection and shows that depth prediction from monocular cues is not essential when lidar is available. It introduces Lift-Attend-Splat, a depth-free fusion method that projects camera features into BEV via a lightweight transformer attention mechanism with lidar context, allowing camera information to influence multiple BEV locations. Across nuScenes, the method outperforms Lift-Splat baselines and competes with state-of-the-art fusion approaches, with additional gains from temporal feature aggregation and test-time ensembling. The work suggests reframing multimodal fusion away from depth-centric projections toward attention-driven fusion, potentially enabling simpler, more robust perception pipelines and informing future camera-only or radar-augmented extensions.

Abstract

Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance. Strikingly, we also find that removing depth estimation altogether does not degrade object detection performance substantially, suggesting that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.

Lift-Attend-Splat: Bird's-eye-view camera-lidar fusion using transformers

TL;DR

This paper interrogates the role of monocular depth in camera-lidar fusion for 3D object detection and shows that depth prediction from monocular cues is not essential when lidar is available. It introduces Lift-Attend-Splat, a depth-free fusion method that projects camera features into BEV via a lightweight transformer attention mechanism with lidar context, allowing camera information to influence multiple BEV locations. Across nuScenes, the method outperforms Lift-Splat baselines and competes with state-of-the-art fusion approaches, with additional gains from temporal feature aggregation and test-time ensembling. The work suggests reframing multimodal fusion away from depth-centric projections toward attention-driven fusion, potentially enabling simpler, more robust perception pipelines and informing future camera-only or radar-augmented extensions.

Abstract

Combining complementary sensor modalities is crucial to providing robust perception for safety-critical robotics applications such as autonomous driving (AD). Recent state-of-the-art camera-lidar fusion methods for AD rely on monocular depth estimation which is a notoriously difficult task compared to using depth information from the lidar directly. Here, we find that this approach does not leverage depth as expected and show that naively improving depth estimation does not lead to improvements in object detection performance. Strikingly, we also find that removing depth estimation altogether does not degrade object detection performance substantially, suggesting that relying on monocular depth could be an unnecessary architectural bottleneck during camera-lidar fusion. In this work, we introduce a novel fusion method that bypasses monocular depth estimation altogether and instead selects and fuses camera and lidar features in a bird's-eye-view grid using a simple attention mechanism. We show that our model can modulate its use of camera features based on the availability of lidar features and that it yields better 3D object detection on the nuScenes dataset than baselines relying on monocular depth estimation.
Paper Structure (32 sections, 8 equations, 9 figures, 3 tables)

This paper contains 32 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Impact of the quality of the monocular depth prediction on the object detection performance of BEVFusion liu2022bevfusion on the nuScenes validation set. We compare BEVFusion and four different variants: adding depth supervision using \ref{['eq:depth sup loss']} with various weights $\lambda$, using lidar depth maps instead of monocular depth estimation (lidar), using a pretrained and frozen depth classifier (pretrained), and finally removing depth estimation altogether by projecting camera features at all depths uniformly using \ref{['eq:swiftblat']} (uniform depth). In our experiments, more accurate depth does not translate to better detection performance and the original model is on-par with using the lidar points directly as a source of depth. Equivalent detection performance was achieved using the uniform depth model, clearly indicating that accurate monocular depth is not necessary for BEVFusion liu2022bevfusion to achieve its performance, see main text and \ref{['sec:suppl:depth results']} for details.
  • Figure 2: Lift-Attend-Splat camera-lidar fusion architecture. (left) Overall architecture: features from the camera and lidar backbones are fused together and merged before being passed to a detection head. (inset) Geometry of our 3D projection: the "Lift" step embeds the lidar BEV features into the projected horizon by lifting the lidar features along the z-direction using bilinear sampling. The "Splat" step corresponds to the inverse transformation in that it projects features from the projected horizon back onto the BEV grid using bilinear sampling, again along the z-direction. (right) Details of the projection module: the "Attend" step in our method lets the lifted lidar features $\tilde{B}^{\text{lid}}_i$ attend to the camera features $F^{\text{cam}}_i$ in the corresponding column using a simple encoder-decoder transformer architecture to produce fused features $D(\tilde{B}_i^{\text{lid}}, E(F^{\text{cam}}_i))$ in frustum space.
  • Figure 3: Object detection performance measured using mAP for objects at different distances from the ego and of different sizes. Our model consistently outperforms baselines based on Lift-Splat, especially at large distances and for small objects.
  • Figure 4: (a, b) Visualisation of where camera features of ground-truth objects are projected onto the BEV grid for our method compared to BEVFusion liu2022bevfusion. We observe that our method is able to place camera features around objects more narrowly than BEVFusion, which is based on monocular depth estimation. (c) Comparison of saliency maps, cropped to aid visualisation, given the camera image (top) for models trained with camera-lidar (middle) and camera only (bottom). When trained with both camera and lidar, our model selects camera features in an area that is different than when trained with camera only, while liu2022bevfusion behaves similarly in both settings.
  • Figure S1: Depth maps obtained after different levels of depth supervision on an example from the nuScenes val set.
  • ...and 4 more figures