VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang; Dasen Dai; Jiyao Wang; Xiao Yang; Jianyu Wang; Siqi Cai

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang, Xiao Yang, Jianyu Wang, Siqi Cai

TL;DR

VidDoS is introduced, which is the first universal ELA framework tailored for Video-LLMs and leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation.

Abstract

Video-LLMs are increasingly deployed in safety-critical applications but are vulnerable to Energy-Latency Attacks (ELAs) that exhaust computational resources. Current image-centric methods fail because temporal aggregation mechanisms dilute individual frame perturbations. Additionally, real-time demands make instance-wise optimization impractical for continuous video streams. We introduce VidDoS, which is the first universal ELA framework tailored for Video-LLMs. Our method leverages universal optimization to create instance-agnostic triggers that require no inference-time gradient calculation. We achieve this through $\textit{masked teacher forcing}$ to steer models toward expensive target sequences, combined with a $\textit{refusal penalty}$ and $\textit{early-termination suppression}$ to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205$\times$ and inflates the inference latency by more than 15$\times$ relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

TL;DR

Abstract

to steer models toward expensive target sequences, combined with a

and

to override conciseness priors. Testing across three mainstream Video-LLMs and three video datasets, which include video question answering and autonomous driving scenarios, shows extreme degradation. VidDoS induces a token expansion of more than 205

and inflates the inference latency by more than 15

relative to clean baselines. Simulations of real-time autonomous driving streams further reveal that this induced latency leads to critical safety violations. We urge the community to recognize and mitigate these high-hazard ELA in Video-LLMs.

Paper Structure (34 sections, 9 equations, 3 figures, 3 tables)

This paper contains 34 sections, 9 equations, 3 figures, 3 tables.

Introduction
Related Work
Video Understanding Models
Energy-Latency Attacks
Methodology
Preliminaries and Threat Model
Problem Definition.
Threat Model.
Design of the Proposed Attack
Motivation.
Masked Teacher Forcing.
Refusal Penalty and Early-Termination Suppression.
Universal Optimization via Patch Training.
Experiment
Setup
...and 19 more sections

Figures (3)

Figure 1: Overview of VidDoS. The framework involves: (1) Defining the attack goal; (2) Temporal injection of a universal patch; (3) Steering decoder trajectory via joint losses (Masked Teacher Forcing, Refusal Penalty, and ETS); (4) Offline patch optimization via Sign-PGD; and (5) Real-time latency induction.
Figure 2: Cumulative latency under VidDoS attack in video streaming scenario.
Figure 3: (a) Cross-dataset transfer. Each entry reports the average output length when a model trained on the source dataset (row) is evaluated on the target dataset (column). (b) Length distribution comparison. For each dataset, output length distributions of the original model and under VidDoS are compared.

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

TL;DR

Abstract

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)