Table of Contents
Fetching ...

Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

Janani Kugarajeevan, Thanikasalam Kokul, Amirthalingam Ramanan, Subha Fernando

TL;DR

This paper tackles background interference and computational inefficiency in one-stream Transformer tracking. It introduces CPDATrack, which combines a learnable Target Probability Estimation (TPE) with Context-Aware Token Pruning (CATP) and a Discriminative Selective Attention (DSA) mechanism to suppress background and distractor tokens while preserving crucial contextual cues. The approach yields state-of-the-art results on GOT-10k (AO 75.1%) and strong performance on TrackingNet, UAV123, and LaSOT, while maintaining real-time speeds around 43 FPS. This work demonstrates that targeted token pruning guided by learnable target likelihoods, coupled with attentive focus within a spatial zone, can substantially improve both tracking robustness and efficiency in Transformer-based visual object tracking.

Abstract

One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker's discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker's ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.

Context-Aware Token Pruning and Discriminative Selective Attention for Transformer Tracking

TL;DR

This paper tackles background interference and computational inefficiency in one-stream Transformer tracking. It introduces CPDATrack, which combines a learnable Target Probability Estimation (TPE) with Context-Aware Token Pruning (CATP) and a Discriminative Selective Attention (DSA) mechanism to suppress background and distractor tokens while preserving crucial contextual cues. The approach yields state-of-the-art results on GOT-10k (AO 75.1%) and strong performance on TrackingNet, UAV123, and LaSOT, while maintaining real-time speeds around 43 FPS. This work demonstrates that targeted token pruning guided by learnable target likelihoods, coupled with attentive focus within a spatial zone, can substantially improve both tracking robustness and efficiency in Transformer-based visual object tracking.

Abstract

One-stream Transformer-based trackers have demonstrated remarkable performance by concatenating template and search region tokens, thereby enabling joint attention across all tokens. However, enabling an excessive proportion of background search tokens to attend to the target template tokens weakens the tracker's discriminative capability. Several token pruning methods have been proposed to mitigate background interference; however, they often remove tokens near the target, leading to the loss of essential contextual information and degraded tracking performance. Moreover, the presence of distractors within the search tokens further reduces the tracker's ability to accurately identify the target. To address these limitations, we propose CPDATrack, a novel tracking framework designed to suppress interference from background and distractor tokens while enhancing computational efficiency. First, a learnable module is integrated between two designated encoder layers to estimate the probability of each search token being associated with the target. Based on these estimates, less-informative background tokens are pruned from the search region while preserving the contextual cues surrounding the target. To further suppress background interference, a discriminative selective attention mechanism is employed that fully blocks search-to-template attention in the early layers. In the subsequent encoder layers, high-probability target tokens are selectively extracted from a localized region to attend to the template tokens, thereby reducing the influence of background and distractor tokens. The proposed CPDATrack achieves state-of-the-art performance across multiple benchmarks, particularly on GOT-10k, where it attains an average overlap of 75.1 percent.

Paper Structure

This paper contains 23 sections, 6 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparison of token pruning strategies: (a) the actual search region; (b) token elimination in OSTrack ye2022joint using conventional background pruning, which may inadvertently discard essential contextual tokens; and (c) the proposed context-aware token pruning method, which preserves a local contextual zone around the target while removing distant background tokens.
  • Figure 2: The overall architecture of the proposed CPDATrack approach. The Target Probability Estimation (TPE) module predicts the target likelihood of each search token based on the target template and dynamic template tokens. Guided by these predictions, the Context-Aware Token Pruning (CATP) module removes less informative background tokens in the search region while preserving essential contextual information. The Discriminative Selective Attention (DSA) Mechanism further reduces the interference of background tokens on the target and dynamic template tokens.
  • Figure 3: Proposed Target Probability Estimation (TPE) module. True target representations from the initial and dynamic templates are concatenated with each search token to predict the likelihood of it belonging to the target.
  • Figure 4: The proposed Context-Aware Token Pruning (CATP) module. The target probabilities of the search region tokens are first aggregated using a sliding $3 \times 3$ window. Subsequently, a contextual zone (CZ) is defined, centered on the token with the highest aggregated target probability, and a fixed number of background tokens outside the CZ are pruned.
  • Figure 5: Overview of the proposed Discriminative Selective Attention (DSA) mechanism. In the early encoder layers, search-to-template cross-attention is suppressed to prevent background interference. After token pruning, search tokens are first divided into target and background tokens. Subsequently, target tokens are further classified into actual target tokens and distractor tokens. In the remaining layers, only the actual target tokens are permitted to perform cross-attention with both the initial and dynamic template tokens.
  • ...and 6 more figures