Table of Contents
Fetching ...

Exploring Dynamic Transformer for Efficient Object Tracking

Jiawen Zhu, Xin Chen, Haiwen Diao, Shuai Li, Jun-Yan He, Chenyang Li, Bin Luo, Dong Wang, Huchuan Lu

TL;DR

This paper addresses the persistent speed-precision dilemma in visual object tracking by introducing a dynamic transformer framework, DyTrack, that performs instance-specific computation through early exits. It attaches multiple decision branches to intermediate transformer layers, enabling cheaper routes for easy frames and deeper reasoning for difficult ones, while a feature recycling mechanism reuses prior computations. A target-aware self-distillation strategy further aligns early exits with the deep teacher to boost accuracy without increasing inference cost. The approach yields Pareto-optimal speed-precision trade-offs on major benchmarks (e.g., LaSOT, GOT-10k, TrackingNet) with a single model, including results like up to 256 fps at 64.9% AUC, demonstrating practical efficiency for real-time deployment. Overall, DyTrack contributes a dynamic routing paradigm, a reusable feature cascade, and distillation-guided early predictions, offering significant impact for resource-constrained visual tracking applications.

Abstract

The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.

Exploring Dynamic Transformer for Efficient Object Tracking

TL;DR

This paper addresses the persistent speed-precision dilemma in visual object tracking by introducing a dynamic transformer framework, DyTrack, that performs instance-specific computation through early exits. It attaches multiple decision branches to intermediate transformer layers, enabling cheaper routes for easy frames and deeper reasoning for difficult ones, while a feature recycling mechanism reuses prior computations. A target-aware self-distillation strategy further aligns early exits with the deep teacher to boost accuracy without increasing inference cost. The approach yields Pareto-optimal speed-precision trade-offs on major benchmarks (e.g., LaSOT, GOT-10k, TrackingNet) with a single model, including results like up to 256 fps at 64.9% AUC, demonstrating practical efficiency for real-time deployment. Overall, DyTrack contributes a dynamic routing paradigm, a reusable feature cascade, and distillation-guided early predictions, offering significant impact for resource-constrained visual tracking applications.

Abstract

The speed-precision trade-off is a critical problem for visual object tracking which usually requires low latency and deployment on constrained resources. Existing solutions for efficient tracking mainly focus on adopting light-weight backbones or modules, which nevertheless come at the cost of a sacrifice in precision. In this paper, inspired by dynamic network routing, we propose DyTrack, a dynamic transformer framework for efficient tracking. Real-world tracking scenarios exhibit diverse levels of complexity. We argue that a simple network is sufficient for easy frames in video sequences, while more computation could be assigned to difficult ones. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Thus, it can achieve higher performance with the same running speed. We formulate instance-specific tracking as a sequential decision problem and attach terminating branches to intermediate layers of the entire model. Especially, to fully utilize the computations, we introduce the feature recycling mechanism to reuse the outputs of predecessors. Furthermore, a target-aware self-distillation strategy is designed to enhance the discriminating capabilities of early predictions by effectively mimicking the representation pattern of the deep model. Extensive experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model. For instance, DyTrack obtains 64.9% AUC on LaSOT with a speed of 256 fps.
Paper Structure (14 sections, 6 equations, 13 figures, 4 tables)

This paper contains 14 sections, 6 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Speed-precision trade-off comparison on LaSOT lasot benchmark. The top right points are in the Pareto front. DyTrack performs better than other competing trackers at similar running speeds (e.g., 32.5 points higher than ECO eco), and runs faster than others when attaching the same performance (e.g., 4.1 times faster than TransT transt).
  • Figure 2: Comparison between models with various depths.
  • Figure 3: Efficient object tracking via instance-specific reasoning.
  • Figure 4: An overview of DyTrack. This framework achieves instance-specific inference where the forward propagation will terminate early when the current feature representation is reliably conditioned on the input. We take the learned IoU score as the condition to determine whether the prediction is sufficiently confident, and the cascade feature recycling mechanism allows reusing computation of the predecessors. DyTrack can achieve promising speed-precision trade-offs.
  • Figure 5: Diagrams of four feature reuse schemas. The output of the score head is omitted for better readability.
  • ...and 8 more figures