First Place Solution to the ECCV 2024 BRAVO Challenge: Evaluating Robustness of Vision Foundation Models for Semantic Segmentation
Tommie Kerssies, Daan de Geus, Gijs Dubbelman
TL;DR
This work demonstrates that fine-tuning Vision Foundation Models (VFMs) with a simple segmentation decoder on Cityscapes yields state-of-the-art robustness for semantic segmentation under diverse, out-of-distribution conditions. By evaluating multiple configurations, notably DINOv2 with ViT backbones and a linear decoder, the authors show that pretraining diversity and end-to-end fine-tuning can outperform complex specialist models across six BRAVO subsets. The results reveal nuanced relationships between semantic accuracy and confidence calibration, with strong OOD detection sometimes arising from decoder choices like Mask2Former. Overall, the study highlights the practical potential of VFMs for robust, well-calibrated semantic segmentation in real-world, degraded environments, while identifying avenues for deeper analysis of decoder effects and calibration strategies.
Abstract
In this report, we present the first place solution to the ECCV 2024 BRAVO Challenge, where a model is trained on Cityscapes and its robustness is evaluated on several out-of-distribution datasets. Our solution leverages the powerful representations learned by vision foundation models, by attaching a simple segmentation decoder to DINOv2 and fine-tuning the entire model. This approach outperforms more complex existing approaches, and achieves first place in the challenge. Our code is publicly available at https://github.com/tue-mps/benchmark-vfm-ss.
