Table of Contents
Fetching ...

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin, Yunpu Ma, Xiaoyu Shen

TL;DR

Extensive evaluations demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively.

Abstract

One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.

UTPTrack: Towards Simple and Unified Token Pruning for Visual Tracking

TL;DR

Extensive evaluations demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively.

Abstract

One-stream Transformer-based trackers achieve advanced performance in visual object tracking but suffer from significant computational overhead that hinders real-time deployment. While token pruning offers a path to efficiency, existing methods are fragmented. They typically prune the search region, dynamic template, and static template in isolation, overlooking critical inter-component dependencies, which yields suboptimal pruning and degraded accuracy. To address this, we introduce UTPTrack, a simple and Unified Token Pruning framework that, for the first time, jointly compresses all three components. UTPTrack employs an attention-guided, token type-aware strategy to holistically model redundancy, a design that seamlessly supports unified tracking across multimodal and language-guided tasks within a single model. Extensive evaluations on 10 benchmarks demonstrate that UTPTrack achieves a new state-of-the-art in the accuracy-efficiency trade-off for pruning-based trackers, pruning 65.4% of vision tokens in RGB-based tracking and 67.5% in unified tracking while preserving 99.7% and 100.5% of baseline performance, respectively. This strong performance across both RGB and multimodal scenarios underlines its potential as a robust foundation for future research in efficient visual tracking. Code will be released at https://github.com/EIT-NLP/UTPTrack.
Paper Structure (60 sections, 10 equations, 9 figures, 17 tables)

This paper contains 60 sections, 10 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: UTPTrack supports RGB-based and unified tracking, prunes redundant tokens in the search region (SR), dynamic template (DT), and static template (ST) to improve efficiency, with $r$ indicating the token retention ratio for each component.
  • Figure 2: Architecture of the proposed UTPTrack. UTPTrack supports both RGB-based and unified tracking. It adopts a one-stream transformer that jointly processes tokens from the search region (SR), dynamic template (DT), and static template (ST). A lightweight Candidate or Template Elimination Module (CTEM) is inserted into encoder layers to prune redundant tokens from all three sources. In the figure, D/T/E denote depth, thermal, and event modalities, respectively.
  • Figure 3: Performance comparison of UTPTrack and other pruning methods under each method's default compression settings at two resolutions. Top (High Resolution): 384 (RGB and Unified). Bottom (Low Resolution): 256 (RGB), and 224 (Unified).
  • Figure 4: Ablation Study on Progressive Pruning. Performance and the number of vision tokens are reported as the keep ratio decreases and CE, DTE, and STE are progressively enabled for the RGB-based tracker (top) and unified tracker (bottom).
  • Figure 5: Visualization of the UTPTrack Pruning Process for RGB-based Tracking.
  • ...and 4 more figures