Table of Contents
Fetching ...

Adversarial Attack for RGB-Event based Visual Object Tracking

Qiang Chen, Xiao Wang, Haowen Wang, Bo Jiang, Lin Zhu, Dawei Zhang, Yonghong Tian, Jin Tang

TL;DR

The paper addresses the robustness of RGB-Event visual object tracking under adversarial perturbations. It introduces a cross-modal attack framework that targets both Event voxels and RGB-Event frames, leveraging gradient-guided optimization, a two-step voxel perturbation strategy, and cross-modal universal perturbations for frames, coupled with a temporal perturbation mechanism. The key contributions include (i) a voxel-based adversarial perturbation method with region-injection and gradient refinement, (ii) a cross-modal fusion attack for RGB-Event voxel tracking, and (iii) a cross-modal frame attack that trains a universal perturbation across modalities. Experiments on COESOT, FE108, and VisEvent show significant degradation of tracking performance across unimodal and multimodal settings, revealing vulnerabilities in RGB-Event multimodal trackers; the authors also release their code for reproducibility.

Abstract

Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on https://github.com/Event-AHU/Adversarial_Attack_Defense

Adversarial Attack for RGB-Event based Visual Object Tracking

TL;DR

The paper addresses the robustness of RGB-Event visual object tracking under adversarial perturbations. It introduces a cross-modal attack framework that targets both Event voxels and RGB-Event frames, leveraging gradient-guided optimization, a two-step voxel perturbation strategy, and cross-modal universal perturbations for frames, coupled with a temporal perturbation mechanism. The key contributions include (i) a voxel-based adversarial perturbation method with region-injection and gradient refinement, (ii) a cross-modal fusion attack for RGB-Event voxel tracking, and (iii) a cross-modal frame attack that trains a universal perturbation across modalities. Experiments on COESOT, FE108, and VisEvent show significant degradation of tracking performance across unimodal and multimodal settings, revealing vulnerabilities in RGB-Event multimodal trackers; the authors also release their code for reproducibility.

Abstract

Visual object tracking is a crucial research topic in the fields of computer vision and multi-modal fusion. Among various approaches, robust visual tracking that combines RGB frames with Event streams has attracted increasing attention from researchers. While striving for high accuracy and efficiency in tracking, it is also important to explore how to effectively conduct adversarial attacks and defenses on RGB-Event stream tracking algorithms, yet research in this area remains relatively scarce. To bridge this gap, in this paper, we propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking. Because of the diverse representations of Event streams, and given that Event voxels and frames are more commonly used, this paper will focus on these two representations for an in-depth study. Specifically, for the RGB-Event voxel, we first optimize the perturbation by adversarial loss to generate RGB frame adversarial examples. For discrete Event voxel representations, we propose a two-step attack strategy, more in detail, we first inject Event voxels into the target region as initialized adversarial examples, then, conduct a gradient-guided optimization by perturbing the spatial location of the Event voxels. For the RGB-Event frame based tracking, we optimize the cross-modal universal perturbation by integrating the gradient information from multimodal data. We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets, i.e., COESOT, FE108, and VisEvent. Extensive experiments show that our method significantly reduces the performance of the tracker across numerous datasets in both unimodal and multimodal scenarios. The source code will be released on https://github.com/Event-AHU/Adversarial_Attack_Defense

Paper Structure

This paper contains 23 sections, 7 equations, 11 figures, 4 tables, 5 algorithms.

Figures (11)

  • Figure 1: An illustration of adversarial attacks for RGB-Event visual object tracking.
  • Figure 2: An overview of our proposed multi-modal adversarial attack framework for the RGB-Voxel based tracking.
  • Figure 3: An overview of our proposed multi-modal adversarial attack framework for the RGB-Event Frame based tracking.
  • Figure 4: Validation of Cross-Modal Attack Fusion for RGB-Event Voxel VOT with temporal perturbation on COESOT.
  • Figure 5: Validation of Cross-Modal Attack Fusion for RGB-Event Voxel VOT's sign function on COESOT.
  • ...and 6 more figures