EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models
Minjae Seo, Myoungsung You, Junhee Lee, Jaehan Kim, Hwanjo Heo, Jintae Oh, Jinwoo Kim
TL;DR
The paper tackles the vulnerability of vision models to resource-exhausting attacks by proposing EO-VLM, a method that uses vision-language model prompts to craft adversarial images that drive GPU energy usage. The approach is model-agnostic and relies on the VLM to extract compromising factors and embed them into inputs via prompts, with energy cost computed as $E = W \cdot t$. Experiments on YOLOv8, MASKDINO, and Detectron2 on an RTX 4090 demonstrate energy increases up to about 50% with a single image, highlighting a practical security risk for real-time vision systems. The work emphasizes the need to address unfiltered VLMs in security-critical pipelines and identifies reinforcement learning as a potential path to optimize adversarial prompts in future work.
Abstract
Vision models are increasingly deployed in critical applications such as autonomous driving and CCTV monitoring, yet they remain susceptible to resource-consuming attacks. In this paper, we introduce a novel energy-overloading attack that leverages vision language model (VLM) prompts to generate adversarial images targeting vision models. These images, though imperceptible to the human eye, significantly increase GPU energy consumption across various vision models, threatening the availability of these systems. Our framework, EO-VLM (Energy Overload via VLM), is model-agnostic, meaning it is not limited by the architecture or type of the target vision model. By exploiting the lack of safety filters in VLMs like DALL-E 3, we create adversarial noise images without requiring prior knowledge or internal structure of the target vision models. Our experiments demonstrate up to a 50% increase in energy consumption, revealing a critical vulnerability in current vision models.
