Questioning the Stability of Visual Question Answering

Amir Rosenfeld; Neta Glazer; Ethan Fetaya

Questioning the Stability of Visual Question Answering

Amir Rosenfeld, Neta Glazer, Ethan Fetaya

TL;DR

This work surveys the reliability of Visual Language Models under benign, semantics-preserving perturbations in both vision and language inputs. By formalizing a stability framework and evaluating across multiple benchmarks and models (including open-source and closed-source systems), it reveals pervasive instability to small perturbations such as pixel shifts, rephrasings, and multilingual translations. The authors show that stability strongly correlates with correctness and can serve as a predictor for large-model performance using signals from smaller, open-source models. They further analyze internal representations and demonstrate that stability patterns persist across models, underscoring a fundamental fragility in current VLMs and urging robustness evaluations that emphasize invariances beyond adversarial perturbations.

Abstract

Visual Language Models (VLMs) have achieved remarkable progress, yet their reliability under small, meaning-preserving input changes remains poorly understood. We present the first large-scale, systematic study of VLM robustness to benign visual and textual perturbations: pixel-level shifts, light geometric transformations, padded rescaling, paraphrasing, and multilingual rewrites that do not alter the underlying semantics of an image-question pair. Across a broad set of models and datasets, we find that modern VLMs are highly sensitive to such minor perturbations: a substantial fraction of samples change their predicted answer under at least one visual or textual modification. We characterize how this instability varies across perturbation types, question categories, and models, revealing that even state-of-the-art systems (e.g., GPT-4o, Gemini 2.0 Flash) frequently fail under shifts as small as a few pixels or harmless rephrasings. We further show that sample-level stability serves as a strong indicator of correctness: stable samples are consistently far more likely to be answered correctly. Leveraging this, we demonstrate that the stability patterns of small, accessible open-source models can be used to predict the correctness of much larger closed-source models with high precision. Our findings expose a fundamental fragility in current VLMs and highlight the need for robustness evaluations that go beyond adversarial perturbations, focusing instead on invariances that models should reliably uphold.

Questioning the Stability of Visual Question Answering

TL;DR

Abstract

Questioning the Stability of Visual Question Answering

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)