The Security Threat of Compressed Projectors in Large Vision-Language Models
Yudong Zhang, Ruobing Xie, Xingwu Sun, Jiansheng Chen, Zhanhui Kang, Di Wang, Yu Wang
TL;DR
This paper addresses security vulnerabilities in visual language projectors within LVLMs by comparing compressed (e.g., Q-former) and uncompressed (e.g., MLP) VLPs. It introduces an attack framework that uses white-box and gray-box settings and a loss formulation $L_{TCP}$ that combines $L_{VE}$ and $L_{VLP}$ to probe VLP security. Across multiple datasets and surrogate VLPs, the study reveals that compressed projectors are significantly more vulnerable to adversarial manipulation, while uncompressed projectors exhibit robust security largely independent of the number of visual tokens, guiding practitioners toward safer VLP choices and suggesting token-pooling or hybrid designs to balance efficiency and security. These findings have practical implications for deploying LVLMs in high-security contexts and for designing defense-oriented VLP architectures.
Abstract
The choice of a suitable visual language projector (VLP) is critical to the successful training of large visual language models (LVLMs). Mainstream VLPs can be broadly categorized into compressed and uncompressed projectors, and each offers distinct advantages in performance and computational efficiency. However, their security implications have not been thoroughly examined. Our comprehensive evaluation reveals significant differences in their security profiles: compressed projectors exhibit substantial vulnerabilities, allowing adversaries to successfully compromise LVLMs even with minimal knowledge of structure information. In stark contrast, uncompressed projectors demonstrate robust security properties and do not introduce additional vulnerabilities. These findings provide critical guidance for researchers in selecting optimal VLPs that enhance the security and reliability of visual language models. The code is available at https://github.com/btzyd/TCP.
