Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images
Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu
TL;DR
This work investigates a security-relevant vulnerability in large vision-language models whereby imperceptible image perturbations can force models to generate much longer output sequences, thereby increasing energy consumption and latency during inference. The authors introduce verbose images, guided by three loss objectives—delaying end-of-sequence tokens, increasing per-token uncertainty, and promoting token-level diversity—together with a temporal weight adjustment to maximize sequence length under an $\ell_p$ perturbation budget. Through extensive experiments on four open-source VLMs across MS-COCO and ImageNet, verbose images achieve up to approximately $7.87$× and $8.56$× increases in sequence length, along with corresponding energy and latency rises, and reveal dispersion of attention and higher object hallucination in generated captions. The study highlights practical implications for deployment and motivates security-minded limitations on generation length and robust defenses in multi-modal inference systems.
Abstract
Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
