SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Yi Pan; Jun-Jie Huang; Zihan Chen; Wentao Zhao; Ziyue Wang

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Yi Pan, Jun-Jie Huang, Zihan Chen, Wentao Zhao, Ziyue Wang

TL;DR

This work tackles the challenge of generating imperceptible, targeted adversarial videos by exploiting spatio-temporal information exchange. It introduces SVASTIN, an architecture that combines a Spatio-Temporal Invertible Neural Network (STIN) with a Guided Target Video Learning (GTVL) module to transfer discriminative content from a target class while preserving perceptual quality, using a $3D$-DWT-based decomposition and Spatio-Temporal Affine Coupling Blocks. The method optimizes an adversarial loss and a guidance loss to produce a target feature tensor and an adversarial video that misleads action classifiers with high fooling rates and low perceptual distortion, as demonstrated on UCF-101 and Kinetics-400 across multiple models. Overall, SVASTIN advances the design of sparse, imperceptible, targeted video attacks and provides a practical framework for evaluating the robustness of video-based DNNs. The work includes code availability, enabling replication and further exploration of spatio-temporal invertible approaches.

Abstract

Robust and imperceptible adversarial video attack is challenging due to the spatial and temporal characteristics of videos. The existing video adversarial attack methods mainly take a gradient-based approach and generate adversarial videos with noticeable perturbations. In this paper, we propose a novel Sparse Adversarial Video Attack via Spatio-Temporal Invertible Neural Networks (SVASTIN) to generate adversarial videos through spatio-temporal feature space information exchanging. It consists of a Guided Target Video Learning (GTVL) module to balance the perturbation budget and optimization speed and a Spatio-Temporal Invertible Neural Network (STIN) module to perform spatio-temporal feature space information exchanging between a source video and the target feature tensor learned by GTVL module. Extensive experiments on UCF-101 and Kinetics-400 demonstrate that our proposed SVASTIN can generate adversarial examples with higher imperceptibility than the state-of-the-art methods with the higher fooling rate. Code is available at \href{https://github.com/Brittany-Chen/SVASTIN}{https://github.com/Brittany-Chen/SVASTIN}.

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

TL;DR

-DWT-based decomposition and Spatio-Temporal Affine Coupling Blocks. The method optimizes an adversarial loss and a guidance loss to produce a target feature tensor and an adversarial video that misleads action classifiers with high fooling rates and low perceptual distortion, as demonstrated on UCF-101 and Kinetics-400 across multiple models. Overall, SVASTIN advances the design of sparse, imperceptible, targeted video attacks and provides a practical framework for evaluating the robustness of video-based DNNs. The work includes code availability, enabling replication and further exploration of spatio-temporal invertible approaches.

Abstract

Paper Structure (10 sections, 3 equations, 2 figures, 4 tables)

This paper contains 10 sections, 3 equations, 2 figures, 4 tables.

Introduction
proposed method
Overview
Spatio-Temporal Invertible Neural Network Module
Guided Target Video Learning Module
Experiments
Experimental Setup
Evaluation on Targeted Attacks
Ablation study
Conclusion

Figures (2)

Figure 1: The overview of Sparse Adversarial Video Attack via Spatio-Temporal Invertible Neural Networks (SVASTIN). The Spatio-Temporal Invertible Neural Network (STIN) module, which utilizes the information preservation property to non-linearly exchange information between the input benign video and the target video. The Guided Target Video Learning (GTVL) module is proposed to update the learnable target video $\bm{X}_{t}$.
Figure 2: Visualization of the generated adversarial video and residual images of different methods on UCF-101 dataset against MVIT model. The clean video is successfully classified by the target classifier as Harming and the target class is Shotput. For each method, the upper row displays the generated adversarial video frames, and lower row shows the residual images which are enlarged by 20 times for better perception.

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

TL;DR

Abstract

SVASTIN: Sparse Video Adversarial Attack via Spatio-Temporal Invertible Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (2)