Table of Contents
Fetching ...

SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation

Fangyu Wu, Yujun Cai

TL;DR

SegDebias tackles spurious correlations in ViT-based CLIP by introducing a test-time, segmentation-guided debiasing pipeline that requires no bias annotations or retraining. By selecting a target attribute, obtaining a segmentation mask, neutralizing non-target regions through a constrained perturbation, and reconstructing the image for zero-shot inference, it reduces background-driven bias while preserving the target signal. Empirical results on Waterbirds and CelebA show improved worst-group accuracy and smaller performance gaps, complemented by higher Attention-IoU indicating better semantic alignment. The approach is model- and data-agnostic, scalable, and opens avenues for annotation-free bias mitigation in vision-language systems.

Abstract

Vision language models such as CLIP have shown remarkable performance in zero shot classification, but remain susceptible to spurious correlations, where irrelevant visual features influence predictions. Existing debiasing methods often require access to training data and explicit group labels to perform fine-tuning or adjust embeddings, which limits their practicality in real-world settings. Test-time methods attempt to avoid this constraint, but many still depend on prior knowledge of dataset specific biases, limiting their generalizability in open set settings. In this work, we propose a test-time debiasing method for ViT based CLIP models that requires no additional training or assumptions of bias annotations. Our approach uses a pretrained segmentation model to isolate the target visual attribute, then adjusts the non target regions so that their embeddings are uniformly similar to all class specific text prompts. This procedure removes unintended bias signals from confounding visual regions while preserving the target attribute. Experiments on Waterbirds and CelebA show that our method outperforms existing test-time debiasing approaches in both group robustness metrics and Attention IoU. These results demonstrate the effectiveness of segmentation guided interventions for scalable and annotation free bias mitigation in vision language models.

SegDebias: Test-Time Bias Mitigation for ViT-Based CLIP via Segmentation

TL;DR

SegDebias tackles spurious correlations in ViT-based CLIP by introducing a test-time, segmentation-guided debiasing pipeline that requires no bias annotations or retraining. By selecting a target attribute, obtaining a segmentation mask, neutralizing non-target regions through a constrained perturbation, and reconstructing the image for zero-shot inference, it reduces background-driven bias while preserving the target signal. Empirical results on Waterbirds and CelebA show improved worst-group accuracy and smaller performance gaps, complemented by higher Attention-IoU indicating better semantic alignment. The approach is model- and data-agnostic, scalable, and opens avenues for annotation-free bias mitigation in vision-language systems.

Abstract

Vision language models such as CLIP have shown remarkable performance in zero shot classification, but remain susceptible to spurious correlations, where irrelevant visual features influence predictions. Existing debiasing methods often require access to training data and explicit group labels to perform fine-tuning or adjust embeddings, which limits their practicality in real-world settings. Test-time methods attempt to avoid this constraint, but many still depend on prior knowledge of dataset specific biases, limiting their generalizability in open set settings. In this work, we propose a test-time debiasing method for ViT based CLIP models that requires no additional training or assumptions of bias annotations. Our approach uses a pretrained segmentation model to isolate the target visual attribute, then adjusts the non target regions so that their embeddings are uniformly similar to all class specific text prompts. This procedure removes unintended bias signals from confounding visual regions while preserving the target attribute. Experiments on Waterbirds and CelebA show that our method outperforms existing test-time debiasing approaches in both group robustness metrics and Attention IoU. These results demonstrate the effectiveness of segmentation guided interventions for scalable and annotation free bias mitigation in vision language models.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Attention-weight feature maps (Grad-CAM) from a ViT-based CLIP model on two binary classification tasks: waterbird vs. landbird (panels a–b) and blond-hair vs. dark-hair (panels c–d). Panels (a) and (c) show mixed attribution where both target and non-target regions are emphasized; panels (b) and (d) illustrate failure modes where attention is misdirected exclusively to irrelevant features. (highlighted colors from red to green indicate increasing levels of attention map values, while blue areas represent minimal or no focus).
  • Figure 2: Correlation between cosine similarity differences for non-target regions (x-axis: $\Delta_{\text{non-target}}$) and full images (y-axis: $\Delta_{\text{full}}$) with respect to two candidate text embeddings. Each point corresponds to an image randomly sampled from the dataset (1,500 total), illustrating how much the non-target region alone biases the prediction in the same direction as the full image across the Waterbirds Sagawa20Waterbirds and CelebA CelebA datasets.
  • Figure 3: Overview of the proposed debiasing pipeline for zero-shot image classification. Given an input image and associated candidate text embeddings, we identify the target attribute (e.g., bird), segment it from the image, and optimize the background (non-target attributes) to have equal cosine distances with all text embeddings. The target attribute is then repainted into the debiased background and passed to CLIP for final zero-shot prediction.
  • Figure 4: Comparison of attention maps before (left) and after (right) applying SegDebias on two tasks. Top row: three pairs of waterbird examples; bottom row: three pairs of CelebA hair examples. In each pair, vanilla CLIP’s attention (left) tends to scatter over confounding background or depiction biases, whereas SegDebias’s attention (right) aligns more closely with the true semantic region, yielding higher interpretable focus.