Table of Contents
Fetching ...

Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter

Chao Liu, Xin Liu, Zitong Yu, Yonghong Hou, Huanjing Yue, Jingyu Yang

TL;DR

This work investigates robustness in RGB-skeleton action recognition under adversarial perturbations and finds that the skeleton modality is more robust than RGB. It introduces the Attention-based Modality Reweighter (AMR), which learns to reweight modality-specific features via attention mechanisms and can be plugged into existing multimodal backbones; AMR also employs a novel loss with deep supervision through auxiliary predictions. The authors provide comprehensive experiments on NTU-RGB+D and iMiGUE, showing AMR achieves state-of-the-art robustness against attacks like PGD and CW and improves the balance of robustness across modalities (e.g., substantial gains in robust accuracy and RI). The results highlight the practical value of AMR for safer multimodal action recognition without requiring additional data, with implications for security-critical applications that rely on RGB-skeleton inputs.

Abstract

Deep neural networks (DNNs) have been applied in many computer vision tasks and achieved state-of-the-art (SOTA) performance. However, misclassification will occur when DNNs predict adversarial examples which are created by adding human-imperceptible adversarial noise to natural examples. This limits the application of DNN in security-critical fields. In order to enhance the robustness of models, previous research has primarily focused on the unimodal domain, such as image recognition and video understanding. Although multi-modal learning has achieved advanced performance in various tasks, such as action recognition, research on the robustness of RGB-skeleton action recognition models is scarce. In this paper, we systematically investigate how to improve the robustness of RGB-skeleton action recognition models. We initially conducted empirical analysis on the robustness of different modalities and observed that the skeleton modality is more robust than the RGB modality. Motivated by this observation, we propose the \formatword{A}ttention-based \formatword{M}odality \formatword{R}eweighter (\formatword{AMR}), which utilizes an attention layer to re-weight the two modalities, enabling the model to learn more robust features. Our AMR is plug-and-play, allowing easy integration with multimodal models. To demonstrate the effectiveness of AMR, we conducted extensive experiments on various datasets. For example, compared to the SOTA methods, AMR exhibits a 43.77\% improvement against PGD20 attacks on the NTU-RGB+D 60 dataset. Furthermore, it effectively balances the differences in robustness between different modalities.

Adversarial Robustness in RGB-Skeleton Action Recognition: Leveraging Attention Modality Reweighter

TL;DR

This work investigates robustness in RGB-skeleton action recognition under adversarial perturbations and finds that the skeleton modality is more robust than RGB. It introduces the Attention-based Modality Reweighter (AMR), which learns to reweight modality-specific features via attention mechanisms and can be plugged into existing multimodal backbones; AMR also employs a novel loss with deep supervision through auxiliary predictions. The authors provide comprehensive experiments on NTU-RGB+D and iMiGUE, showing AMR achieves state-of-the-art robustness against attacks like PGD and CW and improves the balance of robustness across modalities (e.g., substantial gains in robust accuracy and RI). The results highlight the practical value of AMR for safer multimodal action recognition without requiring additional data, with implications for security-critical applications that rely on RGB-skeleton inputs.

Abstract

Deep neural networks (DNNs) have been applied in many computer vision tasks and achieved state-of-the-art (SOTA) performance. However, misclassification will occur when DNNs predict adversarial examples which are created by adding human-imperceptible adversarial noise to natural examples. This limits the application of DNN in security-critical fields. In order to enhance the robustness of models, previous research has primarily focused on the unimodal domain, such as image recognition and video understanding. Although multi-modal learning has achieved advanced performance in various tasks, such as action recognition, research on the robustness of RGB-skeleton action recognition models is scarce. In this paper, we systematically investigate how to improve the robustness of RGB-skeleton action recognition models. We initially conducted empirical analysis on the robustness of different modalities and observed that the skeleton modality is more robust than the RGB modality. Motivated by this observation, we propose the \formatword{A}ttention-based \formatword{M}odality \formatword{R}eweighter (\formatword{AMR}), which utilizes an attention layer to re-weight the two modalities, enabling the model to learn more robust features. Our AMR is plug-and-play, allowing easy integration with multimodal models. To demonstrate the effectiveness of AMR, we conducted extensive experiments on various datasets. For example, compared to the SOTA methods, AMR exhibits a 43.77\% improvement against PGD20 attacks on the NTU-RGB+D 60 dataset. Furthermore, it effectively balances the differences in robustness between different modalities.
Paper Structure (20 sections, 14 equations, 4 figures, 8 tables)

This paper contains 20 sections, 14 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The framework of the proposed method AMR. During the training, we feed-forward both natural and adversarial samples from two modalities into the multimodal network in parallel. Features extracted from adversarial samples are input into AMR, and their weighted counterparts enter respective subsequent layers of the network. These features are simultaneously used for classification to generate auxiliary predictions denoted as $\overset{*}{\mathop{y}}\ $, contributing to the loss function. The natural samples go through regular forward propagation to yield prediction results.
  • Figure 2: Accuracy against three types of adversarial attacks based on the iMiGUE dataset Liu2021iMiGUEAI and the NTU-RGB+D dataset Shahroudy2016NTURA, namely adversarial robustness. The x-axis indicates the attack strength $\varepsilon$ ($\times \frac{2}{255}$). FGSM-S, and FGSM-M respectively denote the FGSM attack on the RGB modality, skeleton modality, and multiple modalities simultaneously, and so on. 'Clean acc' represents the accuracy after multimodal fusion on clean data.The findings manifest that, as the intensity of attacks increases, the robustness of the skeleton modality exhibits a comparatively gradual and smooth decline in contrast to the sharp decrease observed in both the RGB modality and multimodal fusion. Therefore, we confirm the assertion that the skeleton modality shows higher robustness than the RGB modality.
  • Figure 3: Architecture of AMR for two modalities. $X_{R}^{\prime}$ and $X_{s}^{\prime}$, that represent the features at a given layer of two unimodal network, are the inputs to the module. For better visualization, we represent the spatiotemporal dimensions on a single axis.
  • Figure 4: The distribution of weight matrices in different AMRs trained on two datasets. The $x$-axis represents the channels, and the $y$-axis represents the average weight values corresponding to each channel.