Table of Contents
Fetching ...

Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

Fatemeh Nourilenjan Nokabadi, Jean-François Lalonde, Christian Gagné

TL;DR

This study interrogates the reproducibility of adversarial attacks on transformer-based visual trackers across multiple benchmarks (VOT2022-ST, UAV123, GOT10k) and output modalities (bounding boxes and binary masks). It systematically applies four attacks (CSA, IoU, SPARK, RTAA) in white-box and black-box settings to both transformer and non-transformer trackers, revealing that binary masks are generally more susceptible and that white-box attacks are more potent on transformer outputs. Deeper transformer backbones with cross-attention (e.g., MixFormer variants, ROMTrack) can exhibit greater inherent robustness, but overall existing attacks do not fully break these models, highlighting a need for new attack methods tailored to modern trackers. The results offer practical guidance for designing more robust transformer trackers and emphasize the importance of developing stronger white-box and black-box adversarial techniques in tracking. The work also provides reproducible code to facilitate benchmarking and further research in adversarial robustness for tracking.

Abstract

New transformer networks have been integrated into object tracking pipelines and have demonstrated strong performance on the latest benchmarks. This paper focuses on understanding how transformer trackers behave under adversarial attacks and how different attacks perform on tracking datasets as their parameters change. We conducted a series of experiments to evaluate the effectiveness of existing adversarial attacks on object trackers with transformer and non-transformer backbones. We experimented on 7 different trackers, including 3 that are transformer-based, and 4 which leverage other architectures. These trackers are tested against 4 recent attack methods to assess their performance and robustness on VOT2022ST, UAV123 and GOT10k datasets. Our empirical study focuses on evaluating adversarial robustness of object trackers based on bounding box versus binary mask predictions, and attack methods at different levels of perturbations. Interestingly, our study found that altering the perturbation level may not significantly affect the overall object tracking results after the attack. Similarly, the sparsity and imperceptibility of the attack perturbations may remain stable against perturbation level shifts. By applying a specific attack on all transformer trackers, we show that new transformer trackers having a stronger cross-attention modeling achieve a greater adversarial robustness on tracking datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity for new attack methods to effectively tackle the latest types of transformer trackers. The codes necessary to reproduce this study are available at https://github.com/fatemehN/ReproducibilityStudy.

Reproducibility Study on Adversarial Attacks Against Robust Transformer Trackers

TL;DR

This study interrogates the reproducibility of adversarial attacks on transformer-based visual trackers across multiple benchmarks (VOT2022-ST, UAV123, GOT10k) and output modalities (bounding boxes and binary masks). It systematically applies four attacks (CSA, IoU, SPARK, RTAA) in white-box and black-box settings to both transformer and non-transformer trackers, revealing that binary masks are generally more susceptible and that white-box attacks are more potent on transformer outputs. Deeper transformer backbones with cross-attention (e.g., MixFormer variants, ROMTrack) can exhibit greater inherent robustness, but overall existing attacks do not fully break these models, highlighting a need for new attack methods tailored to modern trackers. The results offer practical guidance for designing more robust transformer trackers and emphasize the importance of developing stronger white-box and black-box adversarial techniques in tracking. The work also provides reproducible code to facilitate benchmarking and further research in adversarial robustness for tracking.

Abstract

New transformer networks have been integrated into object tracking pipelines and have demonstrated strong performance on the latest benchmarks. This paper focuses on understanding how transformer trackers behave under adversarial attacks and how different attacks perform on tracking datasets as their parameters change. We conducted a series of experiments to evaluate the effectiveness of existing adversarial attacks on object trackers with transformer and non-transformer backbones. We experimented on 7 different trackers, including 3 that are transformer-based, and 4 which leverage other architectures. These trackers are tested against 4 recent attack methods to assess their performance and robustness on VOT2022ST, UAV123 and GOT10k datasets. Our empirical study focuses on evaluating adversarial robustness of object trackers based on bounding box versus binary mask predictions, and attack methods at different levels of perturbations. Interestingly, our study found that altering the perturbation level may not significantly affect the overall object tracking results after the attack. Similarly, the sparsity and imperceptibility of the attack perturbations may remain stable against perturbation level shifts. By applying a specific attack on all transformer trackers, we show that new transformer trackers having a stronger cross-attention modeling achieve a greater adversarial robustness on tracking datasets, such as VOT2022ST and GOT10k. Our results also indicate the necessity for new attack methods to effectively tackle the latest types of transformer trackers. The codes necessary to reproduce this study are available at https://github.com/fatemehN/ReproducibilityStudy.
Paper Structure (23 sections, 13 equations, 5 figures, 6 tables, 2 algorithms)

This paper contains 23 sections, 13 equations, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Mask vs. bounding box predictions as the output of transformer trackers, MixFormerM cui_mixformer_2022 and TransT-SEG chen_high-performance_2023, while the adversarial attacks applied to perturb the input frame/search region. The TransT-SEG tracker's outputs harmed by the white-box methods, SPARK guo_spark_2020 and RTAA jia_robust_2020, more than black-box attacks, IoU jia_iou_2021 and CSA yan_cooling-shrinking_2020. The green mask/bounding box represents the object tracker's performance while the red mask/bound box belongs to the tracker's performance after each attack.
  • Figure 2: The precision and success plots related to the TransT chen_transformer_2021 performance after RTAA jia_robust_2020 (a, b) and SPARK guo_spark_2020 (c,d) attack under different levels of noise on UAV123 mueller_benchmark_2016 dataset. The average score for each metric is shown in the legend of the plots. The 'red' plot is the original TransT performance without any attack applied on the tracker. The $e$'s are corresponded to $\epsilon$'s in our experiment, changing from $e_1 = 2.55$ to $e_5 = 40.8$ to assess the TransT performance after the white-box attacks under various perturbation levels. The SPARK performances per perturbation level shifts did not change on UAV123dataset as one can observe the SPARK curves are overlapped.
  • Figure 3: The search regions related to the "bubble" sequence in the VOT2022ST dataset kristan_tenth_2023 after applying SPARK guo_spark_2020 attack on TransT chen_transformer_2021 tracker. The perturbed search region is labeled with the SSIM wang_image_2004 measured between search regions before and after the attack. The perturbation maps, following the work of yan_cooling-shrinking_2020, are created to demonstrate the added noise in colors. The L1 norm for perturbation maps are calculated to show the perturbation density/sparsity.
  • Figure 4: The search regions related to the "bubble" sequence in the VOT2022ST dataset kristan_tenth_2023 after applying RTAA jia_robust_2020 attack on TransT chen_transformer_2021 tracker. The perturbed search region is labeled with the SSIM wang_image_2004 measured between search regions before and after the attack. The perturbation maps, following the work of yan_cooling-shrinking_2020, are created to demonstrate the added noise in colors. The L1 norm for perturbation maps are calculated to show the perturbation density (i.e. sparsity).
  • Figure 5: The perturbed frames and perturbation maps generated by the IoU method jia_iou_2021 against ROMTrack cai_robust_2023 using three upper bounds of $\zeta \in \{8k, 10k, 12k \}$. The imperceptibility and L1 norm of the generated perturbations are shown in the frames representing the noise imperceptibility and sparsity of perturbation maps.