Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Shaeke Salman; Md Montasir Bin Shams; Xiuwen Liu; Lingjiong Zhu

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Shaeke Salman, Md Montasir Bin Shams, Xiuwen Liu, Lingjiong Zhu

TL;DR

The paper addresses a fundamental paradox: vision-language transformers often achieve near-perfect zero-shot accuracy yet struggle to generalize robustly beyond standard benchmarks. It introduces a gradient-descent embedding-matching framework to probe the local geometry of the embedding space and demonstrates, on Imagenette, that visually indistinguishable images can be embedded to other classes with high confidence, yielding 0% systematic accuracy. A linearization-based analysis shows that adding Gaussian noise induces a normal distribution in representation space, explaining why robustness degrades under perturbations and why zero-shot performance can mask vulnerabilities. The findings are shown to be model- and dataset-agnostic, highlighting the need for systematic evaluation of generalization in multimodal transformers and offering a practical approach to detect adversarial modifications via noise perturbations.

Abstract

Transformer-based models have dominated natural language processing and other areas in the last few years due to their superior (zero-shot) performance on benchmark datasets. However, these models are poorly understood due to their complexity and size. While probing-based methods are widely used to understand specific properties, the structures of the representation space are not systematically characterized; consequently, it is unclear how such models generalize and overgeneralize to new inputs beyond datasets. In this paper, based on a new gradient descent optimization method, we are able to explore the embedding space of a commonly used vision-language model. Using the Imagenette dataset, we show that while the model achieves over 99\% zero-shot classification performance, it fails systematic evaluations completely. Using a linear approximation, we provide a framework to explain the striking differences. We have also obtained similar results using a different model to support that our results are applicable to other transformer models with continuous inputs. We also propose a robust way to detect the modified images.

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

TL;DR

Abstract

Paper Structure (14 sections, 8 equations, 30 figures, 2 tables)

This paper contains 14 sections, 8 equations, 30 figures, 2 tables.

Introduction
Related Work
Preliminaries
Methodology
Embedding Matching Procedure
The Effects on Normal Distributions
Experiments
Datasets and Settings
Experimental Results
Discussion
Conclusion
Appendix
More on Vision Transformers
Additional Results

Figures (30)

Figure 1: Typical examples from ImageNet obtained using the proposed framework. The visually indistinguishable images have different representations from each other as shown in their low-dimensional projections. Note that the arrow in the title ($original \rightarrow target$) signifies a derived image from the original one by aligning the embedding of the original image with the target image using our method. The projections of embedding-aligned images closely resemble the projections of the aligned class. The matrix shows the classification outcomes from the multimodal ImageBind pretrained model used directly with no modifications; Please refer to the supplementary materials for all the nine embedding-aligned images, projections, and the full $vision \times text$ matrix.
Figure 2: More examples where visually indistinguishable images have very different representations via embedding alignment and therefore very different classification outcomes as shown in the classification probabilities.
Figure 3: The singular values of the Jacobian matrix for the leftmost image in Fig. \ref{['fig:overall']}(a) (bottom) and that for the original version of the image (top).
Figure 4: Comparison of pairwise cosine similarity distribution with the pairs from the same class (red) to that with the pairs from two different classes (purple) from the Imagenette dataset.
Figure 5: Confusion matrix of zero-shot classification performance for all the images from all the ten classes from Imagenette. The overall accuracy is $99.38\%$ for all the images.
...and 25 more figures

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

TL;DR

Abstract

Intriguing Differences Between Zero-Shot and Systematic Evaluations of Vision-Language Transformer Models

Authors

TL;DR

Abstract

Table of Contents

Figures (30)