Table of Contents
Fetching ...

Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen, Jinyin Chen, Isao Echizen

Abstract

Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.

Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

Abstract

Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures.

Figures (7)

  • Figure 1: Proposed evaluation framework. Vision-language models are evaluated against typographic attacks, ImageNet-A, and natural language-induced adversarial examples across multiple downstream tasks. Framework also supports interpretability analysis.
  • Figure 2: The classification performance across clean and natural adversarial datasets. IN: ImageNet. IN-A: ImageNet-A. IN-typo: ImageNet-typographic. LangAdv: language induced adversarial images.
  • Figure 3: The segmentation performance across clean and natural adversarial datasets. PC-typo: PhraseCut-typographic. IN-A: ImageNet-A. LangAdv: language induced adversarial images.
  • Figure 4: The visual question answer performance across clean and natural adversarial datasets. IN-A: ImageNet-A. LangAdv: language induced adversarial images.
  • Figure 5: GradCAM of vision-language models in different natural adversarial images. Quadrants: ImageNet-Typo, RTA100 (top); ImageNet-A, LangAdv (bottom). Columns: Original, CLIP, robust CLIP, BLIP2, SigLIP2. Heatmaps denote high (red) to low (blue) attention intensity.
  • ...and 2 more figures