Table of Contents
Fetching ...

LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

Jianlang Chen, Xuhong Ren, Qing Guo, Felix Juefei-Xu, Di Lin, Wei Feng, Lei Ma, Jianjun Zhao

TL;DR

This work tackles adversarial perturbations in visual object tracking by introducing Language-Driven Resamplable Continuous Representation (LRR), which combines a Spatial-Temporal Implicit Representation (STIR) with a Language-Driven ResampleNet (LResampleNet). STIR reconstructs frame pixels at continuous spatial-temporal coordinates using neighboring frame information, while LResampleNet uses a CLIP-based text guidance from the object template to produce semantically consistent frame resampling. Trained on large-scale video datasets with adversarial perturbations, LRR defends against multiple SOTA attacks across diverse trackers and datasets, often restoring or surpassing clean-data accuracy, and generalizes to transformer-based trackers like ToMP-50. The method runs online at ~25–29 fps and demonstrates strong transferability and robustness, making it a practical preprocessing defense for real-world tracking systems where semantic consistency with the template is crucial.

Abstract

Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.

LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks

TL;DR

This work tackles adversarial perturbations in visual object tracking by introducing Language-Driven Resamplable Continuous Representation (LRR), which combines a Spatial-Temporal Implicit Representation (STIR) with a Language-Driven ResampleNet (LResampleNet). STIR reconstructs frame pixels at continuous spatial-temporal coordinates using neighboring frame information, while LResampleNet uses a CLIP-based text guidance from the object template to produce semantically consistent frame resampling. Trained on large-scale video datasets with adversarial perturbations, LRR defends against multiple SOTA attacks across diverse trackers and datasets, often restoring or surpassing clean-data accuracy, and generalizes to transformer-based trackers like ToMP-50. The method runs online at ~25–29 fps and demonstrates strong transferability and robustness, making it a practical preprocessing defense for real-world tracking systems where semantic consistency with the template is crucial.

Abstract

Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
Paper Structure (22 sections, 6 equations, 9 figures, 13 tables)

This paper contains 22 sections, 6 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: (a) shows the main idea of this work: we propose the language-driven resamplable continuous representation (LRR) that takes the template's text term and historical frames as inputs to reconstruct the incoming frame. (b) shows the results on VOT2019 vot2019 with and without LRR under clean data and different attacks.
  • Figure 2: Pipeline of proposed language-driven resamplable continuous representation (LRR) that contains two key parts, i.e., spatial-temporal implicit representation (STIR) and language-driven ResampleNet (LResampleNet). STIR takes continuous spatial and temporal coordinates as inputs (See point center at the blue rectangle) and estimates the corresponding color value.
  • Figure 3: Visualization comparison before & after LRR defense for SiamRPN++ under CSA attack.
  • Figure 4: Visualization comparison before & after defense from DISCO, STIR and LRR.
  • Figure 5: Visualization comparison ResampleNet with & without language guidance when the input frame contains the object of interest.
  • ...and 4 more figures