An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo; Zenghui Yuan; Wenqi Wei; Daizong Liu; Pan Zhou

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo, Zenghui Yuan, Wenqi Wei, Daizong Liu, Pan Zhou

TL;DR

This work tackles the risk of excessive token generation in Vision-Language Models under token-based pricing by introducing Verbose-Text Induction Attack (VTIA). VTIA splits the attack into two stages: first, reinforcement-learning-based Adversarial Prompt Search to obtain malicious prompt embeddings; second, Vision-Aligned Perturbation Optimization to align the perturbed image embeddings with the adversarial prompts, explicitly maximizing output length. The approach uses a formal token-length objective and complementary losses, $\mathcal{L}_{sim}$ and $\mathcal{L}_{std}$, to drive stable, high-verbosity responses across four VLMs (Blip2, InstructBlip, LLaVA, Qwen2-VL) on MS-COCO, achieving up to approximately 122× longer outputs and near-100% extra-long rate. These results reveal a security vulnerability in token-based cost systems for VLMs and motivate the development of robust defenses and cost-aware mitigations for multimodal inference.

Abstract

With the remarkable success of Vision-Language Models (VLMs) on multimodal tasks, concerns regarding their deployment efficiency have become increasingly prominent. In particular, the number of tokens consumed during the generation process has emerged as a key evaluation metric.Prior studies have shown that specific inputs can induce VLMs to generate lengthy outputs with low information density, which significantly increases energy consumption, latency, and token costs. However, existing methods simply delay the occurrence of the EOS token to implicitly prolong output, and fail to directly maximize the output token length as an explicit optimization objective, lacking stability and controllability.To address these limitations, this paper proposes a novel verbose-text induction attack (VTIA) to inject imperceptible adversarial perturbations into benign images via a two-stage framework, which identifies the most malicious prompt embeddings for optimizing and maximizing the output token of the perturbed images.Specifically, we first perform adversarial prompt search, employing reinforcement learning strategies to automatically identify adversarial prompts capable of inducing the LLM component within VLMs to produce verbose outputs. We then conduct vision-aligned perturbation optimization to craft adversarial examples on input images, maximizing the similarity between the perturbed image's visual embeddings and those of the adversarial prompt, thereby constructing malicious images that trigger verbose text generation. Comprehensive experiments on four popular VLMs demonstrate that our method achieves significant advantages in terms of effectiveness, efficiency, and generalization capability.

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

TL;DR

Abstract

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)