How to Benchmark Vision Foundation Models for Semantic Segmentation?

Tommie Kerssies; Daan de Geus; Gijs Dubbelman

How to Benchmark Vision Foundation Models for Semantic Segmentation?

Tommie Kerssies, Daan de Geus, Gijs Dubbelman

TL;DR

The paper addresses the lack of standardized benchmarks for evaluating vision foundation models (VFMs) on semantic segmentation. It proposes a structured benchmarking framework that varies key factors (encoder freezing, decoder, model size, patch size, datasets, domain shifts) while using Kendall's tau to assess ranking stability and mean IoU as the main metric, with ADE20K as the default task. Its main findings show that end-to-end fine-tuning with ViT-B/16 and a linear decoder provides a representative baseline, linear probing is not reliable, and Mask2Former largely preserves ranking at higher training cost; notably, masked image modeling (MIM) with abstract representations emerges as a crucial pretraining objective, more impactful than the choice of supervision. The practical impact is a guidance for efficient, multi-dataset benchmarking and a public codebase to enable researchers to compare VFMs for semantic segmentation and to re-evaluate benchmarks as new models emerge.

Abstract

Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.

How to Benchmark Vision Foundation Models for Semantic Segmentation?

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 9 figures, 3 tables)

This paper contains 13 sections, 1 equation, 9 figures, 3 tables.

Introduction
Related work
Benchmarking setup
Models
Settings
Evaluation metrics
Implementation details
Results
Impact of settings
Analysis of model performance
Conclusion
Discussion
Acknowledgements.

Figures (9)

Figure 1: Performance ranking impact of settings. Kendall's $\tau$ is used to assess ranking similarity between VFMs under default settings (linear decoder, ViT-B, $16\times16$ patch size, ADE20K, end-to-end fine-tuning) and after changing individual settings, ranging from -1 for a reverse ranking to 1 for an identical ranking.
Figure 2: Default setup results. End-to-end fine-tuning with a linear decoder of the ViT-B variants with a $16\times16$ patch size on ADE20K.
Figure 3: Linear probing results. Freezing the encoder results in a correlation coefficient of 0.47 and reduces training time by 0.6 times compared to end-to-end fine-tuning (blue dots).
Figure 4: Mask2Former results. Using the Mask2Former decoder results in a correlation coefficient of 0.87 and increases training time by 4.1 times compared to a linear decoder (blue dots).
Figure 5: ViT-L results. Using the ViT-L counterparts results in a correlation coefficient of 0.87 and increases training time by 1.8 times compared to the ViT-B variants (blue dots).
...and 4 more figures

How to Benchmark Vision Foundation Models for Semantic Segmentation?

TL;DR

Abstract

How to Benchmark Vision Foundation Models for Semantic Segmentation?

Authors

TL;DR

Abstract

Table of Contents

Figures (9)