Table of Contents
Fetching ...

EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models

Nastaran Darabi, Devashri Naik, Sina Tayebati, Dinithi Jayasuriya, Ranganath Krishnan, Amit Ranjan Trivedi

TL;DR

EigenShield presents an inference-time defense for Vision-Language Models based on Random Matrix Theory, leveraging the spiked covariance model to separate causal, semantically meaningful eigen-directions from correlational noise. The method uses a Robustness-based Nonconformity Score to threshold outlier eigenvectors and projects embeddings onto the resulting causal subspace, filterings adversarial artifacts without retraining. Empirical results across multiple VLMs and threat types show substantial reductions in ASR and harmful content, consistently outperforming adversarial training, UNIGUARD, and CIDER. This work demonstrates spectral analysis as a principled, architecture-agnostic approach to enhancing multimodal robustness with practical inference-time efficiency.

Abstract

Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.

EigenShield: Causal Subspace Filtering via Random Matrix Theory for Adversarially Robust Vision-Language Models

TL;DR

EigenShield presents an inference-time defense for Vision-Language Models based on Random Matrix Theory, leveraging the spiked covariance model to separate causal, semantically meaningful eigen-directions from correlational noise. The method uses a Robustness-based Nonconformity Score to threshold outlier eigenvectors and projects embeddings onto the resulting causal subspace, filterings adversarial artifacts without retraining. Empirical results across multiple VLMs and threat types show substantial reductions in ASR and harmful content, consistently outperforming adversarial training, UNIGUARD, and CIDER. This work demonstrates spectral analysis as a principled, architecture-agnostic approach to enhancing multimodal robustness with practical inference-time efficiency.

Abstract

Vision-Language Models (VLMs) inherit adversarial vulnerabilities of Large Language Models (LLMs), which are further exacerbated by their multimodal nature. Existing defenses, including adversarial training, input transformations, and heuristic detection, are computationally expensive, architecture-dependent, and fragile against adaptive attacks. We introduce EigenShield, an inference-time defense leveraging Random Matrix Theory to quantify adversarial disruptions in high-dimensional VLM representations. Unlike prior methods that rely on empirical heuristics, EigenShield employs the spiked covariance model to detect structured spectral deviations. Using a Robustness-based Nonconformity Score (RbNS) and quantile-based thresholding, it separates causal eigenvectors, which encode semantic information, from correlational eigenvectors that are susceptible to adversarial artifacts. By projecting embeddings onto the causal subspace, EigenShield filters adversarial noise without modifying model parameters or requiring adversarial training. This architecture-independent, attack-agnostic approach significantly reduces the attack success rate, establishing spectral analysis as a principled alternative to conventional defenses. Our results demonstrate that EigenShield consistently outperforms all existing defenses, including adversarial training, UNIGUARD, and CIDER.

Paper Structure

This paper contains 23 sections, 2 theorems, 36 equations, 5 figures, 6 tables.

Key Result

Theorem 1

Suppose $\{\mathbf{x}_i\}_{i=1}^n$ are i.i.d. with mean $\mathbf{0}$ and covariance $\boldsymbol{\Sigma}$ as above, and let $c = p/n > 0$ be fixed as $p,n \to \infty$. Define where $\lambda'_j$ is an eigenvalue of $\boldsymbol{\Sigma}_{\mathrm{signal}}$. Then:

Figures (5)

  • Figure 1: Overview of EigenShield: Jailbreak attacks attempt to produce harmful responses from VLMs by manipulating inputs. EigenShield intervenes by filtering input embeddings via a causal subspace. EigenShield's defense mechanisms are architecture-independent, requiring no retraining, theoretically grounded in Random Matrix Theory with guarantees via eigenvalue distributions, attack-agnostic, and computationally efficient.
  • Figure 2: EigenShield's effect on adversarial representations. (a) Eigenvalue spectrum comparison shows restoration of adversarial images to clean-like distributions. (b) Heatmap visualization depicts eigenvalue variability across 100 images. (c) Eigenvector classification via RbNS separates robust causal (blue) from spurious components (red) using threshold $\tau^*$.
  • Figure 3: Attack Success Rate (%) vs. coverage parameter $\gamma$ for two VLMs: LLaVA-v1.5 (blue) and MiniGPT-4 (red). Higher values of $\gamma$ generally correspond to stricter causal inclusion, leading to a reduction in attack success for both models. The top axis indicates the number of trainable parameters (in thousands) for each $\gamma$. As $\gamma$ increases, more eigenvalue directions are retained, effectively enhancing the model’s capacity.
  • Figure 4: Attack Success Rates (%) across five adversarial attacks on (a) LLaVA-v1.5-7B liu2024visual and (b) MiniGPT-4 zhu2023minigpt. Radial axes represent attack methods, with larger radii indicating higher vulnerability. EigenShield consistently achieves the lowest attack success rates, while other defenses vary in effectiveness.
  • Figure 5: Example prompts and corresponding model responses before and after applying EigenShield. The images represent various sensitive topics, including religion, sexuality, and social issues. EigenShield effectively mitigates adversarial vulnerabilities by filtering out harmful or policy-violating responses while preserving semantic coherence.

Theorems & Definitions (2)

  • Theorem 1: Spiked Covariance Model, adapted from paul2007asymptotics
  • Lemma 1: Asymptotic Location of Spiked Sample Eigenvalues