Table of Contents
Fetching ...

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu, Lingjiong Zhu

TL;DR

The paper addresses a fundamental paradox: vision-language transformers often achieve near-perfect zero-shot accuracy yet struggle to generalize robustly beyond standard benchmarks. It introduces a gradient-descent embedding-matching framework to probe the local geometry of the embedding space and demonstrates, on Imagenette, that visually indistinguishable images can be embedded to other classes with high confidence, yielding 0% systematic accuracy. A linearization-based analysis shows that adding Gaussian noise induces a normal distribution in representation space, explaining why robustness degrades under perturbations and why zero-shot performance can mask vulnerabilities. The findings are shown to be model- and dataset-agnostic, highlighting the need for systematic evaluation of generalization in multimodal transformers and offering a practical approach to detect adversarial modifications via noise perturbations.

Abstract

Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

TL;DR

The paper addresses a fundamental paradox: vision-language transformers often achieve near-perfect zero-shot accuracy yet struggle to generalize robustly beyond standard benchmarks. It introduces a gradient-descent embedding-matching framework to probe the local geometry of the embedding space and demonstrates, on Imagenette, that visually indistinguishable images can be embedded to other classes with high confidence, yielding 0% systematic accuracy. A linearization-based analysis shows that adding Gaussian noise induces a normal distribution in representation space, explaining why robustness degrades under perturbations and why zero-shot performance can mask vulnerabilities. The findings are shown to be model- and dataset-agnostic, highlighting the need for systematic evaluation of generalization in multimodal transformers and offering a practical approach to detect adversarial modifications via noise perturbations.

Abstract

Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.
Paper Structure (14 sections, 8 equations, 30 figures, 2 tables)

This paper contains 14 sections, 8 equations, 30 figures, 2 tables.

Figures (30)

  • Figure 1: Typical examples from ImageNet obtained using the proposed framework. The visually indistinguishable images have different representations from each other as shown in their low-dimensional projections. Note that the arrow in the title ($original \rightarrow target$) signifies a derived image from the original one by aligning the embedding of the original image with the target image using our method. The projections of embedding-aligned images closely resemble the projections of the aligned class. The matrix shows the classification outcomes from the multimodal ImageBind pretrained model used directly with no modifications; Please refer to the supplementary materials for all the nine embedding-aligned images, projections, and the full $vision \times text$ matrix.
  • Figure 2: More examples where visually indistinguishable images have very different representations via embedding alignment and therefore very different classification outcomes as shown in the classification probabilities.
  • Figure 3: The singular values of the Jacobian matrix for the leftmost image in Fig. \ref{['fig:overall']}(a) (bottom) and that for the original version of the image (top).
  • Figure 4: Comparison of pairwise cosine similarity distribution with the pairs from the same class (red) to that with the pairs from two different classes (purple) from the Imagenette dataset.
  • Figure 5: Confusion matrix of zero-shot classification performance for all the images from all the ten classes from Imagenette. The overall accuracy is $99.38\%$ for all the images.
  • ...and 25 more figures