LRR: Language-Driven Resamplable Continuous Representation against Adversarial Tracking Attacks
Jianlang Chen, Xuhong Ren, Qing Guo, Felix Juefei-Xu, Di Lin, Wei Feng, Lei Ma, Jianjun Zhao
TL;DR
This work tackles adversarial perturbations in visual object tracking by introducing Language-Driven Resamplable Continuous Representation (LRR), which combines a Spatial-Temporal Implicit Representation (STIR) with a Language-Driven ResampleNet (LResampleNet). STIR reconstructs frame pixels at continuous spatial-temporal coordinates using neighboring frame information, while LResampleNet uses a CLIP-based text guidance from the object template to produce semantically consistent frame resampling. Trained on large-scale video datasets with adversarial perturbations, LRR defends against multiple SOTA attacks across diverse trackers and datasets, often restoring or surpassing clean-data accuracy, and generalizes to transformer-based trackers like ToMP-50. The method runs online at ~25–29 fps and demonstrates strong transferability and robustness, making it a practical preprocessing defense for real-world tracking systems where semantic consistency with the template is crucial.
Abstract
Visual object tracking plays a critical role in visual-based autonomous systems, as it aims to estimate the position and size of the object of interest within a live video. Despite significant progress made in this field, state-of-the-art (SOTA) trackers often fail when faced with adversarial perturbations in the incoming frames. This can lead to significant robustness and security issues when these trackers are deployed in the real world. To achieve high accuracy on both clean and adversarial data, we propose building a spatial-temporal continuous representation using the semantic text guidance of the object of interest. This novel continuous representation enables us to reconstruct incoming frames to maintain semantic and appearance consistency with the object of interest and its clean counterparts. As a result, our proposed method successfully defends against different SOTA adversarial tracking attacks while maintaining high accuracy on clean data. In particular, our method significantly increases tracking accuracy under adversarial attacks with around 90% relative improvement on UAV123, which is even higher than the accuracy on clean data.
