Table of Contents
Fetching ...

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang, Xiao Yang, Jianyu Wang, Siqi Cai

TL;DR

VidDoS is introduced, which is the first universal ELA framework tailored for Video-LLMs and leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation.

Abstract

Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

TL;DR

VidDoS is introduced, which is the first universal ELA framework tailored for Video-LLMs and leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation.

Abstract

Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through to steer models toward expensive target sequences, combined with a and to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205 and inflates the inference latency by more than 15 relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.
Paper Structure (34 sections, 9 equations, 3 figures, 3 tables)

This paper contains 34 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of VidDoS. The framework involves: (1) Defining the attack goal; (2) Temporal injection of a universal patch; (3) Steering decoder trajectory via joint losses (Masked Teacher Forcing, Refusal Penalty, and ETS); (4) Offline patch optimization via Sign-PGD; and (5) Real-time latency induction.
  • Figure 2: Cumulative latency under VidDoS attack in video streaming scenario.
  • Figure 3: (a) Cross-dataset transfer. Each entry reports the average output length when a model trained on the source dataset (row) is evaluated on the target dataset (column). (b) Length distribution comparison. For each dataset, output length distributions of the original model and under VidDoS are compared.