Table of Contents
Fetching ...

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen, Akshay Chaudhari, Curtis Langlotz

TL;DR

RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors and mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning.

Abstract

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

TL;DR

RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors and mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning.

Abstract

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

Paper Structure

This paper contains 20 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Region-aware Vision-Language learning (RaVL). RaVL takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features.
  • Figure 2: RaVL accurately identifies spurious correlations. Using our evaluation settings, we show that RaVL consistently outperforms prior methods in discovering learned spurious correlations between image features and textual attributes. Here, we provide Precision@10 metrics for a CLIP-RN50 model fine-tuned on synthetic data (129 settings) and real-world data (171 settings).
  • Figure 3: RaVL surfaces spurious correlations in off-the-shelf VLMs.RaVL identifies a spurious correlation learned by CLIP ViT-B/16 between the presence of text-based retail signage and the class label fast food restaurant in a scene classification task. RaVL also surfaces a spurious correlation learned by PubMedCLIP ResNet-50 between metal clips (found in clothing) and the class label cardiomegaly (a heart condition) on a chest X-ray classification task.
  • Figure 4: Example evaluation settings. Here, we provide examples of predefined spurious correlations, fine-tuning datasets, and evaluation datasets associated with a synthetic evaluation setting (top row) and a real-world evaluation setting (bottom row). The example synthetic evaluation setting consists of a predefined spurious correlation between a red rectangle (spurious image feature $\mathbf{e}^{eval}$) and nine (textual attribute $a^{eval}$). This spurious correlation is visible in the vision-language fine-tuning dataset, where the presence of red rectangles and nines are strongly correlated, but not in the evaluation dataset. Similarly, the example real-world evaluation setting consists of a predefined spurious correlation between a person (spurious image feature $\mathbf{e}^{eval}$) and couch (textual attribute $a^{eval}$). Again, this spurious correlation is visible in the vision-language fine-tuning dataset, where the presence of people and couches are strongly correlated, but not in the evaluation dataset.
  • Figure 5: RaVL accurately identifies spurious correlations. Here, we provide an extended version of Figure \ref{['fig:discoverygraphs']}, which demonstrates that RaVL consistently outperforms prior methods in discovering learned spurious correlations between image features and textual attributes. Here, we provide Precision@10 metrics for a CLIP-RN50 model fine-tuned on synthetic data (129 settings) and real-world data (171 settings); a CLIP-RN101 model fine-tuned on synthetic data (162 settings) and real-world data (192 settings); and an average across both model architectures.
  • ...and 1 more figures