Table of Contents
Fetching ...

Robust Bird's Eye View Segmentation by Adapting DINOv2

Merve Rabia Barın, Görkay Aydemir, Fatma Güney

TL;DR

This work proposes to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA), and builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV.

Abstract

Extracting a Bird's Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.

Robust Bird's Eye View Segmentation by Adapting DINOv2

TL;DR

This work proposes to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA), and builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV.

Abstract

Extracting a Bird's Eye View (BEV) representation from multiple camera images offers a cost-effective, scalable alternative to LIDAR-based solutions in autonomous driving. However, the performance of the existing BEV methods drops significantly under various corruptions such as brightness and weather changes or camera failures. To improve the robustness of BEV perception, we propose to adapt a large vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA). Our approach builds on the strong representation space of DINOv2 by adapting it to the BEV task in a state-of-the-art framework, SimpleBEV. Our experiments show increased robustness of BEV perception under various corruptions, with increasing gains from scaling up the model and the input resolution. We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.
Paper Structure (12 sections, 3 equations, 4 figures, 2 tables)

This paper contains 12 sections, 3 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Robustness Analysis on nuScenes-C. We compare the models under different types of corruptions in \ref{['fig:robust_corr']}, where each axis is normalized over the maximum performing model, i.e. ViT-L adaptation. We show the performance drop of models relative to their performance on clean data in \ref{['fig:robust_drop']}, where each axis is normalized to the clean data performance of the corresponding model.
  • Figure 2: Overview. In this work, we propose to adapt DINOv2 to BEV segmentation using Low-Rank Adaptation (LoRA) for a robust BEV model. There are three main steps: i) We encode the camera images using DINOv2 to obtain tokens for each view, with attention weights updated through LoRA. ii) Transform image features from 2D to 3D using pull mechanism proposed by Harley2023ICRA. iii) Decode BEV features to 2D vehicle BEV masks.
  • Figure 2: Training Method and Parameter Efficiency. This table shows the results of different weight update strategies with the corresponding number of learnable parameters. Note that there are additional 5M parameters for the decoder.
  • Figure 3: Varying the Rank of LoRA. This plot illustrates the effect of increasing the LoRA rank (in log scale) on the performance, with rank 0 representing a frozen backbone.