Table of Contents
Fetching ...

FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

Dong Zhao, Jinlong Li, Shuang Wang, Mengyao Wu, Qi Zang, Nicu Sebe, Zhun Zhong

TL;DR

FisherTune is proposed, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM), which measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability.

Abstract

Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.

FisherTune: Fisher-Guided Robust Tuning of Vision Foundation Models for Domain Generalized Segmentation

TL;DR

FisherTune is proposed, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM), which measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability.

Abstract

Vision Foundation Models (VFMs) excel in generalization due to large-scale pretraining, but fine-tuning them for Domain Generalized Semantic Segmentation (DGSS) while maintaining this ability remains challenging. Existing approaches either selectively fine-tune parameters or freeze the VFMs and update only the adapters, both of which may underutilize the VFMs' full potential in DGSS tasks. We observe that domain-sensitive parameters in VFMs, arising from task and distribution differences, can hinder generalization. To address this, we propose \textbf{FisherTune}, a robust fine-tuning method guided by the Domain-Related Fisher Information Matrix (DR-FIM). DR-FIM measures parameter sensitivity across tasks and domains, enabling selective updates that preserve generalization and enhance DGSS adaptability. FisherTune incorporates variational inference to stabilize DR-FIM estimation, treating parameters as Gaussian-distributed variables and leveraging pre-trained priors. Extensive experiments show that FisherTune achieves superior cross-domain segmentation while maintaining generalization, outperforming selective-parameter and adapter-based methods.

Paper Structure

This paper contains 15 sections, 16 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of principles for different VFM adjustment methods: (a) tuning by adapter insertion Wei_2024_CVPRyi2024learning, (b) tuning by manually selected tu2023visual or automatically selected parameters sung2021trainingneuralnetworksfixed, (c) our method for tuning domain-sensitive parameters.
  • Figure 2: Comparison of average performance across multiple VFMs in DGSS experiments on GTA → Cityscapes + BDD100K + Mapillary using different fine-tuning methods, including adapter-based ReinWei_2024_CVPR, manually selected parameter-based VQTtu2023visual, adaptively selected parameter-based ChildTunexu2021raisechildlargelanguage, and our FisherTune.
  • Figure 3: Observations of fine-tuning different VFM layers for DGSS experiments using DINOV2-large under GTA → Cityscapes + BDD100K + Mapillary. It shows that fine-tuning different layers has different effects on the generalization performance of the VFMs. B means blocks.
  • Figure 4: Comparison of FIM and DR-FIM under different degrees of domain shift. The size of the circle indicates the value. It shows that DR-FIM is a generalization of FIM as it additionally considers the cross-domain sensitivity of parameters.
  • Figure 5: Ablation study of estimation ways on Cityscapes → BDD100K (C2B), →ACDC (C2A), and GTAV → Cityscapes(G2C), → BDD100K(G2B) and → Mapillary(G2M).
  • ...and 2 more figures