Table of Contents
Fetching ...

From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge

Hui Lu, Yi Yu, Song Xia, Yiming Yang, Deepu Rajan, Boon Poh Ng, Alex Kot, Xudong Jiang

TL;DR

This work reveals a practical security vulnerability in video foundation models by showing that adversaries can attack downstream tasks using only an open-source video backbone, without task data or access to the victim model. The authors propose Transferable Video Attack (TVA), a task-agnostic framework that combines embedding-space perturbations, a bidirectional temporal-aware contrastive loss, and a temporal consistency loss to maximize cross-model transferability. TVA achieves superior transfer performance across 24 video tasks, substantially degrading downstream models and MLLMs, and demonstrates robustness to defenses and efficiency advantages. The findings highlight the need for security-aware deployment of video foundation backbones and motivate future defense research in the video domain.

Abstract

Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.

From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge

TL;DR

This work reveals a practical security vulnerability in video foundation models by showing that adversaries can attack downstream tasks using only an open-source video backbone, without task data or access to the victim model. The authors propose Transferable Video Attack (TVA), a task-agnostic framework that combines embedding-space perturbations, a bidirectional temporal-aware contrastive loss, and a temporal consistency loss to maximize cross-model transferability. TVA achieves superior transfer performance across 24 video tasks, substantially degrading downstream models and MLLMs, and demonstrates robustness to defenses and efficiency advantages. The findings highlight the need for security-aware deployment of video foundation backbones and motivate future defense research in the video domain.

Abstract

Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.

Paper Structure

This paper contains 32 sections, 2 theorems, 35 equations, 4 figures, 10 tables.

Key Result

Theorem 1

Let $f_{\phi_{\tau}}$ be the victim model finetuned on task $\tau$. The deviation in perturbation updates between the surrogate and downstream models for Form (a) can be expressed as: and, under the Form (b) head-attached case:

Figures (4)

  • Figure 1: Overview of TVA: TVA deceives various downstream models or MLLMs using only the open-source “Video backbone” or “Video encoder”. “FC” (Frisbee Catch) indicates a misclassification. “AdaTAD” denotes the SOTA model.
  • Figure 2: Overview of the Bi-con loss and TC loss: (a) applied to the temporal level, and (b) implemented at the frame level.
  • Figure 3: Performance comparison of different contrastive learning strategies on four TAD models. Video-level uses standard video-level contrast, clean2adv applies one-way clean-to-adversarial loss (Eq. \ref{['eq:con']}), while ours adopts frame-level bidirectional contrast. Our method has the lowest mAP.
  • Figure 4: The influence of temperature in bidirectional contrastive loss.

Theorems & Definitions (8)

  • Definition 1: Transferable Adversarial Attack via Open-Sourced Video Foundation Model
  • Definition 2: Self-supervised Adversarial Perturbation
  • Theorem 1: Deviation in updating adversarial perturbation
  • Remark 1
  • Theorem 2: Gradient Asymmetry in Single-direction Contrastive Loss
  • Remark 2
  • proof
  • proof