Table of Contents
Fetching ...

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

Mehmet Kaan Erol

Abstract

The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds "Yes" (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.

Edge Reliability Gap in Vision-Language Models: Quantifying Failure Modes of Compressed VLMs Under Visual Corruption

Abstract

The rapid compression of large vision-language models (VLMs) for edge deployment raises an underexplored question: do compact models fail differently, not merely more often? This study compares a 7-billion-parameter quantised VLM (Qwen2.5-VL-7B, 4-bit NF4) against a 500-million-parameter FP16 model (SmolVLM2-500M) across 4,000 samples from VQAv2 and COCO Captions. A three-category error taxonomy (Object Blindness, Semantic Drift, Prior Bias) is applied as a diagnostic framework. A text-only GPT-4o judge reveals Semantic Drift (B) as the dominant failure mode on VQAv2 and on COCO for Qwen, with a mixed Object Blindness / Semantic Drift profile for SmolVLM2 on COCO; Prior Bias (C) is present on VQAv2 but absent on COCO for both models. Confidence calibration is measured via Expected Calibration Error (ECE) using geometric mean token probability, compositional reasoning is probed with structured negation probes across four templates, and a blur robustness experiment completes the evaluation. For this model pair, the compact model exhibits a qualitatively distinct failure signature: a 12.5pp larger negation collapse (-33.2pp vs. -20.8pp, Wald 95% CI [8.2, 16.8]pp, p < 10^-8), driven almost entirely by COCO while the VQAv2 gap is not statistically significant (4.5pp, p=0.19). The most discriminating template is false_yn: SMOLVLM2-500M responds "Yes" (incorrectly claiming a depicted object is absent) on 100% of COCO trials vs. 14% for Q WEN 2.5-VL-7B. Asymmetric dataset-dependent miscalibration and a blur experiment with two controlled ablations complete the analysis. The fully reproducible pipeline is released for systematic safety auditing of compressed VLMs prior to edge deployment.

Paper Structure

This paper contains 32 sections, 3 equations, 5 figures, 15 tables.

Figures (5)

  • Figure 1: GPT-4o text-only error-taxonomy distribution across VQAv2 and COCO Captions ($n{=}200$ failures per dataset; SmolVLM2 and Qwen NF4 averaged). VQAv2: Semantic Drift (B) dominates (39.0% combined), followed by Spatial Error (D, 21.5%); Object Blindness (A) is 17.0% and Prior Bias (C) is 7.0%. COCO: Semantic Drift (B) dominates (62.0% combined), with Object Blindness (A) at 28.5%; Prior Bias (C) is 0% for both models.
  • Figure 2: Qualitative failure grid: three representative errors from llm_judge_labels.json. Columns show the input image, ground truth, Qwen2.5-VL-7B output (green), and SmolVLM2-500M output (red). Rows 1--2: Object Blindness on VQAv2 and COCO. Row 3: Spatial Error (VQAv2).
  • Figure 3: Reliability diagrams for both models on VQAv2 (left) and COCO Captions (right). The dashed diagonal represents perfect calibration. Qwen2.5-VL-7B (blue) on VQAv2 collapses its entire 2,000-prediction distribution to a single point at confidence $\approx$0.999, accuracy $\approx$0.556---annotated and isolated far to the right of the diagonal. This is the defining visual of confidence-function degeneration: the model issues maximum confidence unconditionally, regardless of correctness. SmolVLM2-500M (red, dashed) distributes predictions across multiple bins: below the diagonal on VQAv2 (overconfident) and above it on COCO (underconfident).
  • Figure 4: Accuracy under clean ($\sigma{=}0$) and Gaussian-blurred ($\sigma{=}2$) conditions for both models ($n{=}100$ both-correct images). Both conditions are directly measured; no extrapolation is performed. Both models drop $3.0$ pp, giving $\rho = 1.00$ (95% CI $[0.00, 5.00]$; McNemar $p = 0.683$; differential degradation not significant).
  • Figure 5: Negation probe success rates per template across VQAv2 (left) and COCO Captions (right). Qwen2.5-VL-7B (blue) and SmolVLM2-500M (red) are shown side-by-side. The false_yn template is the most discriminating: SmolVLM2-500M collapses to 2% on VQAv2 and 0% on COCO, while Qwen2.5-VL-7B scores 49% and 86%.