Table of Contents
Fetching ...

First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

Tommie Kerssies, Daan de Geus, Gijs Dubbelman

TL;DR

This work demonstrates that fine-tuning Vision Foundation Models (VFMs) with a simple segmentation decoder on Cityscapes yields state-of-the-art robustness for semantic segmentation under diverse, out-of-distribution conditions. By evaluating multiple configurations, notably DINOv2 with ViT backbones and a linear decoder, the authors show that pretraining diversity and end-to-end fine-tuning can outperform complex specialist models across six BRAVO subsets. The results reveal nuanced relationships between semantic accuracy and confidence calibration, with strong OOD detection sometimes arising from decoder choices like Mask2Former. Overall, the study highlights the practical potential of VFMs for robust, well-calibrated semantic segmentation in real-world, degraded environments, while identifying avenues for deeper analysis of decoder effects and calibration strategies.

Abstract

In this report, we present the first place solution to the ECCV 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves first place in the challenge. Our code is publicly available at https://github.com/tue-mps/benchmark-vfm-ss.

First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation

TL;DR

This work demonstrates that fine-tuning Vision Foundation Models (VFMs) with a simple segmentation decoder on Cityscapes yields state-of-the-art robustness for semantic segmentation under diverse, out-of-distribution conditions. By evaluating multiple configurations, notably DINOv2 with ViT backbones and a linear decoder, the authors show that pretraining diversity and end-to-end fine-tuning can outperform complex specialist models across six BRAVO subsets. The results reveal nuanced relationships between semantic accuracy and confidence calibration, with strong OOD detection sometimes arising from decoder choices like Mask2Former. Overall, the study highlights the practical potential of VFMs for robust, well-calibrated semantic segmentation in real-world, degraded environments, while identifying avenues for deeper analysis of decoder effects and calibration strategies.

Abstract

In this report, we present the first place solution to the ECCV 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves first place in the challenge. Our code is publicly available at https://github.com/tue-mps/benchmark-vfm-ss.
Paper Structure (14 sections, 1 equation, 1 figure, 4 tables)

This paper contains 14 sections, 1 equation, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Our meta-approach. We take a pre-trained Vision Foundation Model (VFM), attach a simple segmentation decoder, and fine-tune the entire model for semantic segmentation. The segmentation decoder outputs both the per-pixel classification predictions and the associated confidence scores.