Table of Contents
Fetching ...

Questioning the Stability of Visual Question Answering

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

TL;DR

This work surveys the reliability of Visual Language Models under benign, semantics-preserving perturbations in both vision and language inputs. By formalizing a stability framework and evaluating across multiple benchmarks and models (including open-source and closed-source systems), it reveals pervasive instability to small perturbations such as pixel shifts, rephrasings, and multilingual translations. The authors show that stability strongly correlates with correctness and can serve as a predictor for large-model performance using signals from smaller, open-source models. They further analyze internal representations and demonstrate that stability patterns persist across models, underscoring a fundamental fragility in current VLMs and urging robustness evaluations that emphasize invariances beyond adversarial perturbations.

Abstract

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

Questioning the Stability of Visual Question Answering

TL;DR

This work surveys the reliability of Visual Language Models under benign, semantics-preserving perturbations in both vision and language inputs. By formalizing a stability framework and evaluating across multiple benchmarks and models (including open-source and closed-source systems), it reveals pervasive instability to small perturbations such as pixel shifts, rephrasings, and multilingual translations. The authors show that stability strongly correlates with correctness and can serve as a predictor for large-model performance using signals from smaller, open-source models. They further analyze internal representations and demonstrate that stability patterns persist across models, underscoring a fundamental fragility in current VLMs and urging robustness evaluations that emphasize invariances beyond adversarial perturbations.

Abstract

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

Paper Structure

This paper contains 34 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: The very high sensitivity of VLMs to small, non-adversarial perturbations. (top) A shift by two pixels to the left (barely visible) changes the model's answer from "yes" to "no". (bottom) Change in answer w.r.t. other offsets.
  • Figure 2: Distribution of Answer Entropy (Eq. \ref{['eq:entropy']}) per perturbation type vs. the number of correctly answered samples.
  • Figure 3: Distribution of entropy of sample answers for different perturbation types
  • Figure 4: Layer-wise differences between activations of perturbations which caused a change in answer vs those which did not.
  • Figure 5: Effect of image rotation on answers. All question types are affected, even questions that are not dependent on orientation (rotation invariant).
  • ...and 5 more figures