Table of Contents
Fetching ...

Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition

Ping Li, Jianan Ni, Bo Pang

TL;DR

This paper tackles the problem of transferring adversarial examples to unseen action recognition models in black-box settings. It introduces BMTC, a two-module framework comprising Background Adversarial Mixup (BAM) and Background-induced Temporal Gradient enhancement (BTG), to reduce dependence on boundary similarity and stabilize attack directions across time. The approach leverages reinforcement learning to select background frames and optimizes a combined loss $\mathcal{L}_{total}=\mathcal{L}_{back}+\beta\mathcal{L}_{tgc}$, enabling progressive, temporally coherent attacks via PGD updates. Experiments on UCF101, Kinetics-400, and ImageNet show substantial transferability gains across CNN and ViT architectures, with efficiency advantages over prior methods, and qualitative visualizations confirm perturbations remain largely imperceptible. The work advances practical black-box attacks on action recognition and provides a reusable codebase for further research and defense evaluation.

Abstract

Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at https://github.com/mlvccn/BMTC_TransferAttackVid.

Temporal Consistency Constrained Transferable Adversarial Attacks with Background Mixup for Action Recognition

TL;DR

This paper tackles the problem of transferring adversarial examples to unseen action recognition models in black-box settings. It introduces BMTC, a two-module framework comprising Background Adversarial Mixup (BAM) and Background-induced Temporal Gradient enhancement (BTG), to reduce dependence on boundary similarity and stabilize attack directions across time. The approach leverages reinforcement learning to select background frames and optimizes a combined loss , enabling progressive, temporally coherent attacks via PGD updates. Experiments on UCF101, Kinetics-400, and ImageNet show substantial transferability gains across CNN and ViT architectures, with efficiency advantages over prior methods, and qualitative visualizations confirm perturbations remain largely imperceptible. The work advances practical black-box attacks on action recognition and provides a reusable codebase for further research and defense evaluation.

Abstract

Action recognition models using deep learning are vulnerable to adversarial examples, which are transferable across other models trained on the same data modality. Existing transferable attack methods face two major challenges: 1) they heavily rely on the assumption that the decision boundaries of the surrogate (a.k.a., source) model and the target model are similar, which limits the adversarial transferability; and 2) their decision boundary difference makes the attack direction uncertain, which may result in the gradient oscillation, weakening the adversarial attack. This motivates us to propose a Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition. From the input transformation perspective, we design a model-agnostic background adversarial mixup module to reduce the surrogate-target model dependency. In particular, we randomly sample one video from each category and make its background frame, while selecting the background frame with the top attack ability for mixup with the clean frame by reinforcement learning. Moreover, to ensure an explicit attack direction, we leverage the background category as guidance for updating the gradient of adversarial example, and design a temporal gradient consistency loss, which strengthens the stability of the attack direction on subsequent frames. Empirical studies on two video datasets, i.e., UCF101 and Kinetics-400, and one image dataset, i.e., ImageNet, demonstrate that our method significantly boosts the transferability of adversarial examples across several action/image recognition models. Our code is available at https://github.com/mlvccn/BMTC_TransferAttackVid.

Paper Structure

This paper contains 30 sections, 11 equations, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Illustration of the decision boundaries of the surrogate model and the target model.
  • Figure 2: Illustration of diverse and consistent temporal attack.
  • Figure 3: Overall framework of our Background Mixup-induced Temporal Consistency (BMTC) attack method for action recognition.
  • Figure 4: Examples of UCF101 (left) and Kinetics-400 (right).
  • Figure 5: Transfer attack with adversarial purification on UCF101