Table of Contents
Fetching ...

The Security Threat of Compressed Projectors in Large Vision-Language Models

Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang

TL;DR

This paper addresses security vulnerabilities in visual language projectors within LVLMs by comparing compressed (e.g., Q-former) and uncompressed (e.g., MLP) VLPs. It introduces an attack framework that uses white-box and gray-box settings and a loss formulation $L_{TCP}$ that combines $L_{VE}$ and $L_{VLP}$ to probe VLP security. Across multiple datasets and surrogate VLPs, the study reveals that compressed projectors are significantly more vulnerable to adversarial manipulation, while uncompressed projectors exhibit robust security largely independent of the number of visual tokens, guiding practitioners toward safer VLP choices and suggesting token-pooling or hybrid designs to balance efficiency and security. These findings have practical implications for deploying LVLMs in high-security contexts and for designing defense-oriented VLP architectures.

Abstract

The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.

The Security Threat of Compressed Projectors in Large Vision-Language Models

TL;DR

This paper addresses security vulnerabilities in visual language projectors within LVLMs by comparing compressed (e.g., Q-former) and uncompressed (e.g., MLP) VLPs. It introduces an attack framework that uses white-box and gray-box settings and a loss formulation that combines and to probe VLP security. Across multiple datasets and surrogate VLPs, the study reveals that compressed projectors are significantly more vulnerable to adversarial manipulation, while uncompressed projectors exhibit robust security largely independent of the number of visual tokens, guiding practitioners toward safer VLP choices and suggesting token-pooling or hybrid designs to balance efficiency and security. These findings have practical implications for deploying LVLMs in high-security contexts and for designing defense-oriented VLP architectures.

Abstract

The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.

Paper Structure

This paper contains 21 sections, 3 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: The attack pipeline operates by first extracting surrogate VLPs $\{f_\text{VLP}^1,\dots,f_\text{VLP}^N\}$, which are then utilized to generate adversarial examples $x'_i$ through the loss functions $\mathcal{L}_\text{VE}$, $\mathcal{L}_\text{VLP}$ and $\mathcal{L}_\text{TCP}(\beta, K)$ (TCP stands for "Threat of Compressed Projectors"). A key aspect of this investigation is determining whether incorporating attacks on VLPs increases security vulnerabilities compared to solely attacking VEs, thereby assessing the robustness of the VLP structure.
  • Figure 2: Some adversarial examples of attacks on InstructBLIP-Vicuna-7B.
  • Figure 3: Some adversarial examples of attacks on InstructBLIP-Vicuna-7B.