Table of Contents
Fetching ...

Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking

Markus Käppeler, Özgün Çiçek, Daniele Cattaneo, Claudius Gläser, Yakov Miron, Abhinav Valada

TL;DR

This work tackles camera-only 3D object detection and multi-object tracking by bridging PV and BEV representations. It introduces DualViewDistill, a hybrid framework that fuses PV features with BEV maps enriched through foundation-model guided distillation of DINOv2 features, producing dense BEV priors and improving association via deformable aggregation. A key contribution is offline pseudo-label generation and a cosine-similarity distillation loss that teaches the BEV encoder to encode semantic and geometric priors without requiring LiDAR at inference. The method achieves state-of-the-art performance on nuScenes and Argoverse 2, demonstrates robustness across weather conditions, and highlights the practical value of online BEV priors for reliable perception in autonomous driving.

Abstract

Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .

Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking

TL;DR

This work tackles camera-only 3D object detection and multi-object tracking by bridging PV and BEV representations. It introduces DualViewDistill, a hybrid framework that fuses PV features with BEV maps enriched through foundation-model guided distillation of DINOv2 features, producing dense BEV priors and improving association via deformable aggregation. A key contribution is offline pseudo-label generation and a cosine-similarity distillation loss that teaches the BEV encoder to encode semantic and geometric priors without requiring LiDAR at inference. The method achieves state-of-the-art performance on nuScenes and Argoverse 2, demonstrates robustness across weather conditions, and highlights the practical value of online BEV priors for reliable perception in autonomous driving.

Abstract

Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird's-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .

Paper Structure

This paper contains 25 sections, 6 equations, 8 figures, 12 tables.

Figures (8)

  • Figure 1: DualViewDistill (c) integrates both perspective view (PV) features and latent bird’s-eye-view (BEV) features guided by DINOv2 features, to improve 3D object detection and tracking, while previous methods either rely only on BEV (a) or PV (b) features.
  • Figure 2: Previous SOTA methods vs. DualViewDistill on the nuScenes 3D Multi-Object Tracking Benchmark. DualViewDistill achieves superior performance across all key tracking metrics, notably improving AMOTA and reducing ID switches on the nuScenes test set.
  • Figure 3: Overview of our proposed DualViewDistill approach. We jointly leverage PV and BEV camera features, while enriching the BEV representation through DINOv2-guided distillation. Pseudo-labels for distillation are generated by projecting DINOv2 features into BEV space via the LiDAR point cloud before training. Both components jointly improve 3D object detection and tracking.
  • Figure 4: Detection and Tracking Transformer Head. Object queries consisting of anchors and instance features interact via deformable aggregation with PV and BEV features. An instance memory is used for temporal fusion and to propagate queries for tracking.
  • Figure 5: Visualization of 3D object detection results from our proposed DualViewDistill with ViT-L backbone on the nuScenes test set. Classes are color-coded as follows: car, truck, construction vehicle, bus, trailer, barrier, motorcycle, bicycle, pedestrian, traffic cone.
  • ...and 3 more figures