Table of Contents
Fetching ...

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Eyal Hadad, Mordechai Guri

Abstract

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Abstract

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

Paper Structure

This paper contains 54 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Dual-layer threat model overview. A co-located, unprivileged process observes shared hardware behavior while a local VLM processes an image with dynamic resolution. The input-dependent workload creates two distinct side-channels: Tier 1 leverages OS-level timing metrics from CPU cores to infer the input's geometric grid (aspect ratio). In contrast, Tier 2 exploits HPC from the LLC to infer the input's visual semantic density.
  • Figure 2: AnyRes dynamic preprocessing. The model tiles the image into an $(m\times n)$ grid based on aspect ratio and adds a global view, yielding $(m\times n)+1$ patches. Different grids (e.g., $1\times2$ vs. $2\times2$) produce different amounts of work, creating a timing signal.
  • Figure 3: Deterministic Geometric Leakage. Distribution of inference time across aspect ratios in (a) Cold Cache and (b) Warm Cache states. Note the distinct "pyramid" structure: Square inputs (1:1) incur a $\sim 2.3\times$ latency penalty compared to rectangular inputs (1:2, 2:1), creating non-overlapping clusters that enable 100% classification accuracy regardless of cache state.
  • Figure 4: Joint timing and cache leakage. Projection of the Combined Attack vector into a two-dimensional space. The X-axis (execution time) separates inputs by geometry clusters (portrait vs. square) due to varying patch counts. The Y-axis (LLC misses) resolves the semantic ambiguity within each geometric cluster (e.g., separating dense X-ray images from sparse documents).
  • Figure 5: Confusion Matrix of the Combined Attack. The results visually validate the dual-layer threat model: deterministic timing differences eliminate cross-geometry errors (empty off-diagonal quadrants), while cache-miss telemetry resolves semantic ambiguity within identical geometries. Notably, the attack achieves perfect or near-perfect classification for privacy-critical targets, such as encrypted data and X-ray images.
  • ...and 3 more figures