Table of Contents
Fetching ...

ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Ziyi Gao, Kai Chen, Zhipeng Wei, Tingshu Mou, Jingjing Chen, Zhiyu Tan, Hao Li, Yu-Gang Jiang

TL;DR

ReToMe-VA tackles the problem of creating imperceptible yet transferable adversarial video clips using diffusion models. It combines Timestep-wise Adversarial Latent Optimization (TALO) for per-timestep latent optimization to preserve spatial structure and Recursive Token Merging (ReToMe) for cross-frame token merging to enforce temporal coherence, enabling inter-frame gradient propagation. The framework significantly improves adversarial transferability over prior image-based and diffusion-based methods (average gains exceeding 14.16%), while maintaining frame quality and temporal smoothness and showing robustness against several defenses. This work advances video adversarial research and offers a practical approach to evaluating and enhancing robustness of video recognition systems.

Abstract

Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.

ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

TL;DR

ReToMe-VA tackles the problem of creating imperceptible yet transferable adversarial video clips using diffusion models. It combines Timestep-wise Adversarial Latent Optimization (TALO) for per-timestep latent optimization to preserve spatial structure and Recursive Token Merging (ReToMe) for cross-frame token merging to enforce temporal coherence, enabling inter-frame gradient propagation. The framework significantly improves adversarial transferability over prior image-based and diffusion-based methods (average gains exceeding 14.16%), while maintaining frame quality and temporal smoothness and showing robustness against several defenses. This work advances video adversarial research and offers a practical approach to evaluating and enhancing robustness of video recognition systems.

Abstract

Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16% on average.
Paper Structure (16 sections, 7 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 7 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Difference between restricted attacks, unrestricted attacks, and diffusion-based unrestricted attacks.
  • Figure 2: Framework overview of the proposed ReToMe-VA. For a video clip, DDIM inversion is applied to map the benign frames into the latent space. Timestep-wise Adversarial Latent Optimization is employed during the DDIM sampling process to optimize the latents. Throughout the whole pipeline, Recursive Token Merging and Recursive Token Unmerging Modules are integrated into the diffusion model to enhance its effectiveness. Additionally, structure loss is utilized to maintain the structural consistency of video frames. Ultimately, the resulting adversarial video clip is capable of deceiving the target model.
  • Figure 3: Recursive token merging process.
  • Figure 4: Qualitative results of frame quality. (a) Visual quality comparisons among different attack methods. (b) More adversarial frames generated from ReToMe-VA. The Left is the benign frame and the right is the adversarial frame.
  • Figure 5: A Sample of generated video from our method.
  • ...and 1 more figures