Table of Contents
Fetching ...

Language-Guided Invariance Probing of Vision-Language Models

Jae Joong Lee

TL;DR

Language-Guided Invariance Probing (LGIP) introduces a targeted benchmark to assess vision–language models for linguistic robustness by measuring invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips in the image–text similarity space. Using a fixed image input and paraphrase/flip perturbations applied to COCO captions, the method reports invariance error, semantic sensitivity, and positive-rate metrics, enabling cross-architecture comparisons. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants exhibit a favorable invariance–sensitivity trade-off, while SigLIP-family models show high invariance error and often prefer flipped captions, revealing attribute-level failures hidden by standard retrieval metrics. LGIP demonstrates that strong zero-shot performance does not guarantee linguistic robustness and provides a lightweight, model-agnostic diagnostic that can guide robustness-aware development of vision–language systems.

Abstract

Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

Language-Guided Invariance Probing of Vision-Language Models

TL;DR

Language-Guided Invariance Probing (LGIP) introduces a targeted benchmark to assess vision–language models for linguistic robustness by measuring invariance to meaning-preserving paraphrases and sensitivity to meaning-changing semantic flips in the image–text similarity space. Using a fixed image input and paraphrase/flip perturbations applied to COCO captions, the method reports invariance error, semantic sensitivity, and positive-rate metrics, enabling cross-architecture comparisons. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants exhibit a favorable invariance–sensitivity trade-off, while SigLIP-family models show high invariance error and often prefer flipped captions, revealing attribute-level failures hidden by standard retrieval metrics. LGIP demonstrates that strong zero-shot performance does not guarantee linguistic robustness and provides a lightweight, model-agnostic diagnostic that can guide robustness-aware development of vision–language systems.

Abstract

Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic. Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.

Paper Structure

This paper contains 23 sections, 8 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Given an image and its human caption, we generate meaning preserving paraphrases and semantic flips that change the object, color, or count, then feed all variants into a frozen vision language model to measure invariance and sensitivity of its similarity scores.
  • Figure 2: Trade-off between linguistic invariance and semantic sensitivity on LGIP. Each point corresponds to a model, plotted by invariance error ($\mathcal{E}_{\text{inv}}$, lower is better) and semantic sensitivity ($\mathcal{E}_{\text{sens}}$, higher is better). EVA02-CLIP and large OpenCLIP models lie on a favorable frontier, while SigLIP-family models cluster in a region with high invariance error and low sensitivity.
  • Figure 3: Qualitative LGIP examples comparing EVA02-CLIP (E) and SigLIP (S) on object flips. In all four cases EVA assigns higher similarity to the original caption, while SigLIP prefers the flipped caption (scores in red), indicating a lack of semantic sensitivity to object substitutions.