Same Answer, Different Representations: Hidden instability in VLMs
Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini
TL;DR
The paper demonstrates that output-level robustness in Vision Language Models hides substantial internal representation drift under meaning-preserving perturbations. It introduces a representation-aware and frequency-aware framework that jointly analyzes embedding stability, spectral drift, and local token structure (Dirichlet energy) across SEEDBench, MMMU, and POPE. Empirically, larger models do not guarantee improved robustness; perturbations can move internal representations substantially while leaves predictions unchanged, and effects differ by task (reasoning vs hallucination) and perturbation type. By treating perturbations as cross-frequency spectral drift rather than pixel-level noise, the work provides a unified view of multimodal instability and calls for robustness evaluations that go beyond label stability to address underlying representation coherence and decision margins.
Abstract
The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
