Table of Contents
Fetching ...

Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

Yin Zhang, Yongqiang Zhang, Yaoyue Zheng, Bogdan Raducanu, Dan Liu

TL;DR

This work tackles domain generalization in semantic segmentation using Vision Foundation Models (VFMs) by addressing artifacts from long-term pretraining. It introduces Causal-Tune, which first separates frequency-domain features into causal and non-causal components via a Discrete Cosine Transform (DCT) and a Gaussian band-pass, discards the non-causal part, and then refines the causal part with learnable tokens in the frequency domain before converting back to the spatial domain. The approach achieves state-of-the-art or competitive results across multiple cross-domain benchmarks, with notable gains in adverse weather (e.g., Snow) and real-to-real transfers, while providing extensive ablations and visualizations to validate the causal-factor hypothesis. These findings demonstrate a practical, plug-in fine-tuning strategy that enhances DGSS robustness for VFMs without full model re-training.

Abstract

Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

Causal-Tune: Mining Causal Factors from Vision Foundation Models for Domain Generalized Semantic Segmentation

TL;DR

This work tackles domain generalization in semantic segmentation using Vision Foundation Models (VFMs) by addressing artifacts from long-term pretraining. It introduces Causal-Tune, which first separates frequency-domain features into causal and non-causal components via a Discrete Cosine Transform (DCT) and a Gaussian band-pass, discards the non-causal part, and then refines the causal part with learnable tokens in the frequency domain before converting back to the spatial domain. The approach achieves state-of-the-art or competitive results across multiple cross-domain benchmarks, with notable gains in adverse weather (e.g., Snow) and real-to-real transfers, while providing extensive ablations and visualizations to validate the causal-factor hypothesis. These findings demonstrate a practical, plug-in fine-tuning strategy that enhances DGSS robustness for VFMs without full model re-training.

Abstract

Fine-tuning Vision Foundation Models (VFMs) with a small number of parameters has shown remarkable performance in Domain Generalized Semantic Segmentation (DGSS). Most existing works either train lightweight adapters or refine intermediate features to achieve better generalization on unseen domains. However, they both overlook the fact that long-term pre-trained VFMs often exhibit artifacts, which hinder the utilization of valuable representations and ultimately degrade DGSS performance. Inspired by causal mechanisms, we observe that these artifacts are associated with non-causal factors, which usually reside in the low- and high-frequency components of the VFM spectrum. In this paper, we explicitly examine the causal and non-causal factors of features within VFMs for DGSS, and propose a simple yet effective method to identify and disentangle them, enabling more robust domain generalization. Specifically, we propose Causal-Tune, a novel fine-tuning strategy designed to extract causal factors and suppress non-causal ones from the features of VFMs. First, we extract the frequency spectrum of features from each layer using the Discrete Cosine Transform (DCT). A Gaussian band-pass filter is then applied to separate the spectrum into causal and non-causal components. To further refine the causal components, we introduce a set of causal-aware learnable tokens that operate in the frequency domain, while the non-causal components are discarded. Finally, refined features are transformed back into the spatial domain via inverse DCT and passed to the next layer. Extensive experiments conducted on various cross-domain tasks demonstrate the effectiveness of Causal-Tune. In particular, our method achieves superior performance under adverse weather conditions, improving +4.8% mIoU over the baseline in snow conditions.

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Visualization of DINOv2 feature maps. (a) Features extracted from the frozen DINOv2 contain noticeable artifacts. (b) Artifacts persist after applying existing adapter-based fine-tuning methods. (c) Our proposed Causal-Tune effectively suppresses these artifacts and guides the model to focus on domain-invariant causal factors.
  • Figure 2: Left: Causal factors and non-causal factors (contain explicit and implicit non-causal factors). Right: Visualization of images adding various non-causal factors actively after DCT, high- and low-frequency filtering (H&LF), and inverse DCT.
  • Figure 3: The pipeline of our proposed Causal-Tune. (a) The output feature $f_{i}$ of layer $L_{i}$ are first transformed to the frequency domain feature $F^{DCT}_i$ using DCT, and then a Gaussian band-pass filter separates it into causal factors $F^{cau}_i$ (red) and non-causal factors $F^{n-cau}_i$ (green). Only the causal factors are used for subsequent fine-tuning, while the non-causal factors are discarded. (b) A series of causal-aware learnable tokens $T^{cau}_{i}$ interacts with the causal factors $F^{cau}_i$ through an attention mechanism to refine them. The refined causal factors $\hat{F}^{cau}_i$ are then transformed back to the spatial domain $\hat{f}^{cau}_i$ using the iDCT.
  • Figure 4: Accuracy matrix of cutoff frequency analysis under $C. \rightarrow$ ACDC generalization. The horizontal and vertical axis are low- and high-cutoff frequency, respectively.
  • Figure 5: of feature map from different VFMs. We show the feature maps of frozen DINOv2 dinov2, EVA02 eva02, CLIP clip and after fine-tuning by our method.
  • ...and 1 more figures