Table of Contents
Fetching ...

Same Answer, Different Representations: Hidden instability in VLMs

Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini

TL;DR

The paper demonstrates that output-level robustness in Vision Language Models hides substantial internal representation drift under meaning-preserving perturbations. It introduces a representation-aware and frequency-aware framework that jointly analyzes embedding stability, spectral drift, and local token structure (Dirichlet energy) across SEEDBench, MMMU, and POPE. Empirically, larger models do not guarantee improved robustness; perturbations can move internal representations substantially while leaves predictions unchanged, and effects differ by task (reasoning vs hallucination) and perturbation type. By treating perturbations as cross-frequency spectral drift rather than pixel-level noise, the work provides a unified view of multimodal instability and calls for robustness evaluations that go beyond label stability to address underlying representation coherence and decision margins.

Abstract

The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.

Same Answer, Different Representations: Hidden instability in VLMs

TL;DR

The paper demonstrates that output-level robustness in Vision Language Models hides substantial internal representation drift under meaning-preserving perturbations. It introduces a representation-aware and frequency-aware framework that jointly analyzes embedding stability, spectral drift, and local token structure (Dirichlet energy) across SEEDBench, MMMU, and POPE. Empirically, larger models do not guarantee improved robustness; perturbations can move internal representations substantially while leaves predictions unchanged, and effects differ by task (reasoning vs hallucination) and perturbation type. By treating perturbations as cross-frequency spectral drift rather than pixel-level noise, the work provides a unified view of multimodal instability and calls for robustness evaluations that go beyond label stability to address underlying representation coherence and decision margins.

Abstract

The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
Paper Structure (65 sections, 8 equations, 20 figures, 8 tables)

This paper contains 65 sections, 8 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Cosine distance ($1-\cos$), Drift versus control drift for the ans_mcq_free embedding under Translation and Textoverlay perturbation. Blue shows perturbation-induced drift relative to the base image; orange shows control drift (base image versus randomly sampled other images). Left: Translation. Right: Textoverlay. Unlike geometric perturbations, the Textoverlay perturbation-induced distribution does not remain well separated from control drift, indicating that the representation no longer stays local in embedding space.
  • Figure 2: Qwen3-VL (Instruct) scaling on SEEDBench. Left: base accuracy versus ground truth. Right: average flip rate under natural perturbations (lower is better).
  • Figure 3: Visual perturbation examples applied to a SEEDBench sample used in our robustness evaluation. The figure displays the original base image (Top-Left) alongside three geometric transformations: translation (Top-Right), padding and cropping (Bottom-Left), and $-30^{\circ}$ rotation (Bottom-Right). These perturbations serve to test the model's structural consistency under spatial variation without altering semantic content.
  • Figure 4: L2 distance, Drift versus control drift for the ans_mcq_free embedding under Translation and text overlay perturbations. Blue shows perturbation-induced drift relative to the base image; orange shows control drift (base image versus randomly sampled other images). Left: Translation. Right: Textoverlay. Unlike geometric perturbations, the perturbation-induced distribution does not remain well separated from control drift, indicating that the representation no longer stays local in embedding space.
  • Figure 5: Correctness transition statistics under natural perturbations for different Qwen3-VL model scales on SEEDBench.
  • ...and 15 more figures