Table of Contents
Fetching ...

Correlation-Embedded Transformer Tracking: A Single-Branch Framework

Fei Xie, Wankou Yang, Chunyu Wang, Lei Chu, Yue Cao, Chao Ma, Wenjun Zeng

TL;DR

This work rethinks visual object tracking by replacing the conventional two-branch Siamese pipeline with a fully transformer-based Single-Branch Transformer (SBT) that embeds cross-image correlation throughout the feature network. By unifying feature extraction and correlation in a single stream, SBT achieves strong target–distractor discrimination while maintaining coherence across dissimilar targets. The authors then develop an improved variant, SuperSBT, featuring a hierarchical three-stage backbone, a local modeling layer, a unified relation modeling layer, relative position encoding, Masked Image Modeling pre-training, temporal modeling, and a Mix-MLP prediction head, yielding state-of-the-art results on eight VOT benchmarks and high FPS. This approach simplifies tracking architecture, enhances efficiency, and offers a strong, scalable baseline for future transformer-based visual tracking research.

Abstract

Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.

Correlation-Embedded Transformer Tracking: A Single-Branch Framework

TL;DR

This work rethinks visual object tracking by replacing the conventional two-branch Siamese pipeline with a fully transformer-based Single-Branch Transformer (SBT) that embeds cross-image correlation throughout the feature network. By unifying feature extraction and correlation in a single stream, SBT achieves strong target–distractor discrimination while maintaining coherence across dissimilar targets. The authors then develop an improved variant, SuperSBT, featuring a hierarchical three-stage backbone, a local modeling layer, a unified relation modeling layer, relative position encoding, Masked Image Modeling pre-training, temporal modeling, and a Mix-MLP prediction head, yielding state-of-the-art results on eight VOT benchmarks and high FPS. This approach simplifies tracking architecture, enhances efficiency, and offers a strong, scalable baseline for future transformer-based visual tracking research.

Abstract

Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.
Paper Structure (35 sections, 15 equations, 12 figures, 8 tables)

This paper contains 35 sections, 15 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Comparison of state-of-the-art trackers on GOT-10k got. We visualize the AO performance with respect to the model size and running speed. All reported results follow the official GOT-10k test protocol. Our SBT and SuperSBT variants achieve superior results with high speed.
  • Figure 2: (a1) Standard Siamese-like feature extraction. (b1) Our single-branch framework uses joint feature extraction and correlation. Our pipeline removes separated correlation steps, e.g., Siamese cropping correlation siamrpn, DCF atom and transformer-based correlation transt; (a2)/(b2) are the TSNE tsne visualizations of search features in (a1)/(b1) when feature networks go deeper.
  • Figure 3: (a) Architecture of our proposed Single-Branch Transformer framework for tracking (SBT). Unlike Siamese, DCF, and Transformer-based methods, it has no standalone module for computing correlation. Instead, it embeds correlation in all Feature Relation Modeling (FRM) layers at different network levels. The fully fused features of the search image are directly fed to the prediction network to obtain the localization and size of the target. (b) shows the structure of the FRM layer, which is a variant of the transformer vit layer. There are two options for attention operators in the FRM layer, i.e., Self-Attention (SA) and Cross-Attention (CA). SA operator fuses features within the same image while the CA operator mixes features across images.
  • Figure 4: Studies on the number/position of FRM-CA block. (a) Different model settings, (b) Speed vs. different model settings,(c) Tracking performance vs. position of earliest FRM-CA layer, (d) Tracking performance vs. number of FRM-CA layers, (e) Tracking performance vs. pre-trained or not, (f) Tracking performance vs. intervals between the FRM-CA layer.
  • Figure 5: Architecture of our improved Single Branch Transformer framework for tracking (SuperSBT). Based on the summarized design principles, we upgrade the SBT baseline with a local modeling layer, unified relation modeling, and reasonable architecture variants.
  • ...and 7 more figures