Table of Contents
Fetching ...

MixFormerV2: Efficient Fully Transformer Tracking

Yutao Cui, Tianhui Song, Gangshan Wu, Limin Wang

TL;DR

This work tackles the efficiency bottlenecks of Transformer-based visual tracking by introducing MixFormerV2, a fully transformer tracker that eliminates dense convolutional heads and complex score modules. It leverages four learnable prediction tokens concatenated with target and search tokens, enabling regression and confidence scoring through lightweight MLP heads within a unified backbone. A distillation-based model reduction, combining dense-to-sparse and deep-to-shallow strategies (with an intermediate teacher and MLP reduction), further improves efficiency on GPU and CPU. Across benchmarks, MixFormerV2 achieves strong accuracy with real-time performance, notably LaSOT 70.6% AUC at 165 FPS on GPU and CPU-real-time tracking for MixFormerV2-S while surpassing several prior efficient trackers.

Abstract

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.

MixFormerV2: Efficient Fully Transformer Tracking

TL;DR

This work tackles the efficiency bottlenecks of Transformer-based visual tracking by introducing MixFormerV2, a fully transformer tracker that eliminates dense convolutional heads and complex score modules. It leverages four learnable prediction tokens concatenated with target and search tokens, enabling regression and confidence scoring through lightweight MLP heads within a unified backbone. A distillation-based model reduction, combining dense-to-sparse and deep-to-shallow strategies (with an intermediate teacher and MLP reduction), further improves efficiency on GPU and CPU. Across benchmarks, MixFormerV2 achieves strong accuracy with real-time performance, notably LaSOT 70.6% AUC at 165 FPS on GPU and CPU-real-time tracking for MixFormerV2-S while surpassing several prior efficient trackers.

Abstract

Transformer-based trackers have achieved strong accuracy on the standard benchmarks. However, their efficiency remains an obstacle to practical deployment on both GPU and CPU platforms. In this paper, to overcome this issue, we propose a fully transformer tracking framework, coined as \emph{MixFormerV2}, without any dense convolutional operation and complex score prediction module. Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas. Then, we apply the unified transformer backbone on these mixed token sequence. These prediction tokens are able to capture the complex correlation between target template and search area via mixed attentions. Based on them, we can easily predict the tracking box and estimate its confidence score through simple MLP heads. To further improve the efficiency of MixFormerV2, we present a new distillation-based model reduction paradigm, including dense-to-sparse distillation and deep-to-shallow distillation. The former one aims to transfer knowledge from the dense-head based MixViT to our fully transformer tracker, while the latter one is used to prune some layers of the backbone. We instantiate two types of MixForemrV2, where the MixFormerV2-B achieves an AUC of 70.6\% on LaSOT and an AUC of 57.4\% on TNL2k with a high GPU speed of 165 FPS, and the MixFormerV2-S surpasses FEAR-L by 2.7\% AUC on LaSOT with a real-time CPU speed.
Paper Structure (42 sections, 13 equations, 7 figures, 8 tables)

This paper contains 42 sections, 13 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Efficiency analysis on 8-layer MixViT with different heads. 'Pyram. Corner' represents for the pyramidal corner head mixformer_jour.
  • Figure 2: MixFormerV2 Framework. MixFormerV2 is a fully transformer tracking framework, composed of a transformer backbone and two simple MLP heads on the learnable prediction tokens.
  • Figure 3: Distillation-Based Model Reduction for MixFormerV2. The 'Stage1' represents for the dense-to-sparse distillation, while the 'Stage2' is the deep-to-shallow distillation. The blocks with orange arrows are to be supervised and blocks with dotted line are to be eliminated.
  • Figure 4: Progressive Depth Pruning Process for eliminating blocks. All weights in this block decay to zeros and finally only residual connection works, turning into an identity block.
  • Figure 5: Visualization of prediction-token-to-search attention maps, where the prediction tokens are served as query of attention operation.
  • ...and 2 more figures