Table of Contents
Fetching ...

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

Kuofeng Gao, Yang Bai, Jindong Gu, Shu-Tao Xia, Philip Torr, Zhifeng Li, Wei Liu

TL;DR

This work investigates a security-relevant vulnerability in large vision-language models whereby imperceptible image perturbations can force models to generate much longer output sequences, thereby increasing energy consumption and latency during inference. The authors introduce verbose images, guided by three loss objectives—delaying end-of-sequence tokens, increasing per-token uncertainty, and promoting token-level diversity—together with a temporal weight adjustment to maximize sequence length under an $\ell_p$ perturbation budget. Through extensive experiments on four open-source VLMs across MS-COCO and ImageNet, verbose images achieve up to approximately $7.87$× and $8.56$× increases in sequence length, along with corresponding energy and latency rises, and reveal dispersion of attention and higher object hallucination in generated captions. The study highlights practical implications for deployment and motivates security-minded limitations on generation length and robust defenses in multi-modal inference systems.

Abstract

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.

Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images

TL;DR

This work investigates a security-relevant vulnerability in large vision-language models whereby imperceptible image perturbations can force models to generate much longer output sequences, thereby increasing energy consumption and latency during inference. The authors introduce verbose images, guided by three loss objectives—delaying end-of-sequence tokens, increasing per-token uncertainty, and promoting token-level diversity—together with a temporal weight adjustment to maximize sequence length under an perturbation budget. Through extensive experiments on four open-source VLMs across MS-COCO and ImageNet, verbose images achieve up to approximately × and × increases in sequence length, along with corresponding energy and latency rises, and reveal dispersion of attention and higher object hallucination in generated captions. The study highlights practical implications for deployment and motivates security-minded limitations on generation length and robust defenses in multi-modal inference systems.

Abstract

Large vision-language models (VLMs) such as GPT-4 have achieved exceptional performance across various multi-modal tasks. However, the deployment of VLMs necessitates substantial energy consumption and computational resources. Once attackers maliciously induce high energy consumption and latency time (energy-latency cost) during inference of VLMs, it will exhaust computational resources. In this paper, we explore this attack surface about availability of VLMs and aim to induce high energy-latency cost during inference of VLMs. We find that high energy-latency cost during inference of VLMs can be manipulated by maximizing the length of generated sequences. To this end, we propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences during inference. Concretely, we design three loss objectives. First, a loss is proposed to delay the occurrence of end-of-sequence (EOS) token, where EOS token is a signal for VLMs to stop generating further tokens. Moreover, an uncertainty loss and a token diversity loss are proposed to increase the uncertainty over each generated token and the diversity among all tokens of the whole generated sequence, respectively, which can break output dependency at token-level and sequence-level. Furthermore, a temporal weight adjustment algorithm is proposed, which can effectively balance these losses. Extensive experiments demonstrate that our verbose images can increase the length of generated sequences by 7.87 times and 8.56 times compared to original images on MS-COCO and ImageNet datasets, which presents potential challenges for various applications. Our code is available at https://github.com/KuofengGao/Verbose_Images.
Paper Structure (35 sections, 1 theorem, 7 equations, 12 figures, 21 tables, 1 algorithm)

This paper contains 35 sections, 1 theorem, 7 equations, 12 figures, 21 tables, 1 algorithm.

Key Result

Proposition 1

fazel2002matrix The rank of the concatenated matrix of hidden states among all generated tokens can be heuristically measured using the nuclear norm of the concatenated matrix of hidden states among all generated tokens.

Figures (12)

  • Figure 1: The approximately positive linear relationship between energy consumption, latency time, and the length of generated sequences in VLMs. Following shumailov2021sponge, energy consumption is estimated by NVIDIA Management Library (NVML), and latency time is the response time of an inference.
  • Figure 2: An overview of verbose images against VLMs to increase the length of generated sequences, thereby inducing higher energy-latency cost. Three losses are designed to craft verbose images by delaying EOS occurrence, enhancing output uncertainty, and improving token diversity. Besides, a temporal weight adjustment algorithm is proposed to better utilize the three objectives.
  • Figure 3: The length distribution of four VLM models: (a) BLIP. (b) BLIP-2. (c) InstructBLIP. (d) MiniGPT-4. The peak of length distribution of our verbose images shifts towards longer sequences.
  • Figure 4: GradCAM for the original image $\bm{x}$ and our verbose counterpart $\bm{x}'$. The attention of our verbose images is more dispersed and uniform. We intercept only a part of the generated content.
  • Figure 5: The length distribution of four VLM models on MS-COCO dataset, including (a) BLIP. (b) BLIP-2. (c) InstructBLIP. (d) MiniGPT-4. The peak of length distribution of our verbose images shift towards longer sequences.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Definition 1
  • Proposition 1