Table of Contents
Fetching ...

Mechanistically Interpreting Compression in Vision-Language Models

Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul

Abstract

Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps circuit structure intact but rotates and attenuates internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better aligned. Leveraging this insight, we also introduce VLMSafe-420, a novel benchmark that pairs harmful inputs with matched benign counterfactuals across various safety categories. Our findings show that pruning causes a sharp drop in genuine refusal behavior, suggesting that the choice of compression has safety implications.

Mechanistically Interpreting Compression in Vision-Language Models

Abstract

Compressed vision-language models (VLMs) are widely used to reduce memory and compute costs, making them a suitable choice for real-world deployment. However, compressing these models raises concerns about whether internal computations and safety behaviors are preserved. In this work, we use causal circuit analysis and crosscoder-based feature comparisons to examine how pruning and quantization fundamentally change the internals across representative VLMs. We observe that pruning generally keeps circuit structure intact but rotates and attenuates internal features, while quantization modifies the circuits at a higher level yet leaves the surviving features better aligned. Leveraging this insight, we also introduce VLMSafe-420, a novel benchmark that pairs harmful inputs with matched benign counterfactuals across various safety categories. Our findings show that pruning causes a sharp drop in genuine refusal behavior, suggesting that the choice of compression has safety implications.

Paper Structure

This paper contains 55 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Edge Activation Patching on BLIP-VQA for Visual-Counterfact (green indicates higher importance). Wanda (50%) mostly retains the original circuits (attention heads and MLPs) and component-wise importances. Conversely, INT4 quantization heavily modifies these pathways and relies on newer mechanisms.
  • Figure 2: Crosscoder class distribution on Visual-Counterfact for BLIP-VQA and LLaVA, using $\text{TopK} = 200$. Features are classified as uncompressed-only, shared-aligned, intermediate, redirected, attenuated, or compressed-only based on $\rho_i$ (decoder norm ratio) and $\theta_i$ (cosine similarity). The distribution quantitatively substantiates how much each compression method preserves, modifies, or replaces the original feature structure.
  • Figure 3: Refusal token probabilities from logit-lens analysis.
  • Figure 4: Crosscoder class distributions on Visual-Counterfact across BLIP-VQA (single module compression) and Qwen3-VL-2B (combined compression).
  • Figure 5: Content warning: This image includes harmful text that the authors do not endorse and is used solely for research and evaluation purposes.Typographic attack pair in VLMSafe-420.
  • ...and 2 more figures