Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Xiangyang Yang; Dan Zeng; Xucheng Wang; You Wu; Hengzhou Ye; Qijun Zhao; Shuiwang Li

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Xiangyang Yang, Dan Zeng, Xucheng Wang, You Wu, Hengzhou Ye, Qijun Zhao, Shuiwang Li

TL;DR

This work tackles the efficiency bottleneck of transformer-based visual trackers on constrained hardware by presenting ABTrack, a framework that adaptively bypasses transformer blocks and prunes token dimensions. It introduces a Bypass Decision Module with p_i = $\sigma(l^i(b^i))$ and a threshold $\rho$ to selectively skip blocks, while enforcing non-bypass of the initial $n_{enf}$ layers to preserve low-level features. A vision transformer pruning method using dimension reduction matrices $\mathbf{D}_1$ and $\mathbf{D}_2$ with $L_1$ regularization enables end-to-end training and subsequent binarization for runtime efficiency, with a local ranking strategy outperforming global ranking. The system is evaluated across multiple benchmarks, delivering state-of-the-art real-time performance with minimal gains in accuracy, and demonstrates broad applicability across different ViT backbones and trackers. Overall, ABTrack provides a practical, generalizable approach to adaptive computation in vision transformers, facilitating efficient deployment of high-accuracy tracking on edge devices.

Abstract

Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack.

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

TL;DR

and a threshold

to selectively skip blocks, while enforcing non-bypass of the initial

layers to preserve low-level features. A vision transformer pruning method using dimension reduction matrices

and

with

regularization enables end-to-end training and subsequent binarization for runtime efficiency, with a local ranking strategy outperforming global ranking. The system is evaluated across multiple benchmarks, delivering state-of-the-art real-time performance with minimal gains in accuracy, and demonstrates broad applicability across different ViT backbones and trackers. Overall, ABTrack provides a practical, generalizable approach to adaptive computation in vision transformers, facilitating efficient deployment of high-accuracy tracking on edge devices.

Abstract

Paper Structure (18 sections, 8 equations, 5 figures, 10 tables)

This paper contains 18 sections, 8 equations, 5 figures, 10 tables.

Introduction
Related Work
Visual Tracking
Efficient Tracking Methods
Efficient Vision Transformers
Proposed Approach
Overview
Bypass Decision Module (BDM)
Vision Transformer Pruning (VTP)
Prediction Head and Training Objective
Experiments
Implementation Details
Datasets
State-of-the-art Comparisons
Ablation Study
...and 3 more sections

Figures (5)

Figure 1: This figure shows the tracking results of three trackers adapted from DropMAE Wu2023DropMAEMA with different number of ViT layers, dubbed DropMAE-$N$, $N= 1, 6, 11$ being the number of layers. Note that despite DropMAE-11 successfully tracking all three targets, DropMAE-1 succeeds in tracking the boat with only one layer and DropMAE-6 succeeds in tracking the boat and the person with only six layers. These results suggest that deep semantic features or relations are not always necessary for visual tracking task.
Figure 2: (Left) Overview of the proposed ABTrack framework. It contains a single-stream backbone and a prediction head, in which the backbone consists of pruned ViT blocks and Bypass Decision Modules. (Right) The implementation details of the proposed Bypass Decision Module.
Figure 3: Comparison between global ranking pruning and local ranking pruning, where the grey blocks represent the dimensions pruned in each transformer block.
Figure 4: Illustration of the number of remaining ViT blocks, the IOU, and the predicted bounding boxes of ABTrack-DeiT and ABTrack-DeiT($\tau=\tau_0$) across three samples from LaSOT.
Figure 5: Qualitative analysis on five video sequences. These sequences were sourced from LaSOT and UAV123.

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

TL;DR

Abstract

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (5)