Table of Contents
Fetching ...

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

Cen Lu, Yung-Chen Tang, Andrea Cavallaro

TL;DR

The paper investigates the robustness of large vision-language models to neuron-level failures and introduces Consistently Activated Neurons (CAN), a method that jointly uses activation magnitude and gradient sensitivity to identify critical neurons. Experiments show that ablating only a handful of language-model FFN neurons can cause catastrophic collapse, with LLaVA needing as few as five neurons and InstructBLIP requiring around 1200, revealing a bottleneck in the language backbone rather than in vision modules. A two-stage collapse pattern is observed: initial expressive degradation followed by sudden, complete collapse. These findings have important safety implications for LVLM deployment, suggesting targeted defenses should focus on safeguarding the language-model FFN pathways. The work also provides a data-driven framework for locating vulnerability hotspots across LVLM architectures and object categories.

Abstract

Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model's feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.

Minimal neuron ablation triggers catastrophic collapse in the language core of Large Vision-Language Models

TL;DR

The paper investigates the robustness of large vision-language models to neuron-level failures and introduces Consistently Activated Neurons (CAN), a method that jointly uses activation magnitude and gradient sensitivity to identify critical neurons. Experiments show that ablating only a handful of language-model FFN neurons can cause catastrophic collapse, with LLaVA needing as few as five neurons and InstructBLIP requiring around 1200, revealing a bottleneck in the language backbone rather than in vision modules. A two-stage collapse pattern is observed: initial expressive degradation followed by sudden, complete collapse. These findings have important safety implications for LVLM deployment, suggesting targeted defenses should focus on safeguarding the language-model FFN pathways. The work also provides a data-driven framework for locating vulnerability hotspots across LVLM architectures and object categories.

Abstract

Large Vision-Language Models (LVLMs) have shown impressive multimodal understanding capabilities, yet their robustness is poorly understood. In this paper, we investigate the structural vulnerabilities of LVLMs to identify any critical neurons whose removal triggers catastrophic collapse. In this context, we propose CAN, a method to detect Consistently Activated Neurons and to locate critical neurons by progressive masking. Experiments on LLaVA-1.5-7b-hf and InstructBLIP-Vicuna-7b reveal that masking only a tiny portion of the language model's feed-forward networks (just as few as four neurons in extreme cases) suffices to trigger catastrophic collapse. Notably, critical neurons are predominantly localized in the language model rather than in the vision components, and the down-projection layer is a particularly vulnerable structure. We also observe a consistent two-stage collapse pattern: initial expressive degradation followed by sudden, complete collapse. Our findings provide important insights for safety research in LVLMs.

Paper Structure

This paper contains 12 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Catastrophic collapse modes induced by masking critical neurons in the language model feed-forward network (FFN) of LLaVA-1.5-7b-hf. The original model generates correct outputs (green box), but masking only 4-5 neurons in the language model causes catastrophic failure (red boxes).
  • Figure 2: CAN first ranks neurons by importance combining activation and gradient magnitudes. Next, masking the top-$k$ critical neurons causes catastrophic collapse, with the generation of meaningless outputs regardless of the input.
  • Figure 3: Architecture of the Feed-Forward Network (FFN) module in the transformer layer for LLaVA-1.5-7b-hf and InstructBLIP-vicuna-7b. Numbers in parentheses indicate tensor dimensions. The identification and masking of critical neurons in the blue component (gate_proj) and the orange component (down_proj) are discussed in Section \ref{['sec:Q1']} and Section \ref{['sec:q3']}, respectively.
  • Figure 4: Comparison of PPL and CLIP score changes during progressive masking of neurons in gate_proj from FFN using car images. Stage 1 indicates expressive degradation and Stage 2 indicates complete collapse, as explained in Section \ref{['sec:Q1']}.
  • Figure 5: Comparison of PPL and CLIP score changes during masking of neurons in down_proj using car images from FFN for the two models. Stage 1 indicates an expressive degradation and Stage 2 indicates a complete collapse.
  • ...and 1 more figures