Table of Contents
Fetching ...

PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

Udo Schlegel, Franziska Weeber, Jian Lan, Thomas Seidl

TL;DR

This paper tackles the problem of CLIP's robustness to paraphrase, a critical concern for reliable and fair multimodal retrieval. It introduces the Paraphrase Ranking Stability Metric (PRSM), which jointly assesses global ranking stability via Spearman correlation and local retrieval stability via top-k overlap, applied to three paraphrase strategies on the Social Counterfactuals dataset. Empirical results show very low global stability (<0.04) across conditions, with moderate and paraphrase-type-dependent local stability, and small but consistent gender-related differences, suggesting potential bias amplification under paraphrase. The work highlights the need for paraphrase-invariant training and bias-aware evaluation to ensure fair and dependable deployment of vision-language systems like CLIP.

Abstract

Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

PRSM: A Measure to Evaluate CLIP's Robustness Against Paraphrases

TL;DR

This paper tackles the problem of CLIP's robustness to paraphrase, a critical concern for reliable and fair multimodal retrieval. It introduces the Paraphrase Ranking Stability Metric (PRSM), which jointly assesses global ranking stability via Spearman correlation and local retrieval stability via top-k overlap, applied to three paraphrase strategies on the Social Counterfactuals dataset. Empirical results show very low global stability (<0.04) across conditions, with moderate and paraphrase-type-dependent local stability, and small but consistent gender-related differences, suggesting potential bias amplification under paraphrase. The work highlights the need for paraphrase-invariant training and bias-aware evaluation to ensure fair and dependable deployment of vision-language systems like CLIP.

Abstract

Contrastive Language-Image Pre-training (CLIP) is a widely used multimodal model that aligns text and image representations through large-scale training. While it performs strongly on zero-shot and few-shot tasks, its robustness to linguistic variation, particularly paraphrasing, remains underexplored. Paraphrase robustness is essential for reliable deployment, especially in socially sensitive contexts where inconsistent representations can amplify demographic biases. In this paper, we introduce the Paraphrase Ranking Stability Metric (PRSM), a novel measure for quantifying CLIP's sensitivity to paraphrased queries. Using the Social Counterfactuals dataset, a benchmark designed to reveal social and demographic biases, we empirically assess CLIP's stability under paraphrastic variation, examine the interaction between paraphrase robustness and gender, and discuss implications for fairness and equitable deployment of multimodal systems. Our analysis reveals that robustness varies across paraphrasing strategies, with subtle yet consistent differences observed between male- and female-associated queries.

Paper Structure

This paper contains 5 sections, 1 table.