Table of Contents
Fetching ...

Progressive Scaling Visual Object Tracking

Jack Hong, Shilin Yan, Zehao Xiao, Jiayin Cai, Xiaolong Jiang, Yao Hu, Henghui Ding

TL;DR

This work tackles the problem of efficiently scaling visual object tracking models by systematically studying model size, training data volume, and input resolution. It introduces DT-Training, a progressive scaling framework that combines small teacher transfer with dual-branch alignment, enabling smooth optimization and iterative growth across training stages via $L_{total}(f; \hat{f}) = L_{clean}(f) + \lambda_{transfer}L_{transfer}(f; \hat{f}) + \lambda_{align}L_{align}(f)$. The approach delivers substantial gains, including a $4.7\%$ improvement on LaSOT when upgrading from ViT-Base to ViT-Large at 384, and achieves a mean AUC of $64.8$ on GTrack Bench, exceeding prior methods by at least $1.4$ points, while maintaining inference speed. Importantly, DT-Training demonstrates strong generalization to multimodal data and transfer to downstream tasks such as object detection, underscoring its practical relevance for robust, scalable vision systems.

Abstract

In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.

Progressive Scaling Visual Object Tracking

TL;DR

This work tackles the problem of efficiently scaling visual object tracking models by systematically studying model size, training data volume, and input resolution. It introduces DT-Training, a progressive scaling framework that combines small teacher transfer with dual-branch alignment, enabling smooth optimization and iterative growth across training stages via . The approach delivers substantial gains, including a improvement on LaSOT when upgrading from ViT-Base to ViT-Large at 384, and achieves a mean AUC of on GTrack Bench, exceeding prior methods by at least points, while maintaining inference speed. Importantly, DT-Training demonstrates strong generalization to multimodal data and transfer to downstream tasks such as object detection, underscoring its practical relevance for robust, scalable vision systems.

Abstract

In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.

Paper Structure

This paper contains 24 sections, 6 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Pioneer Experiments. We analyze the impact of three key factors in visual object tracking: (a) model size, (b) training data volume, and (c) input resolution.
  • Figure 2: Overview of our progressive scaling approach, DT-Training. Our DT-Training includes small teacher transfer and dual-branch alignment. We provide an illustrative example of continuous iterative expansion to show a gradual increase in training data, model size, and image resolution. The order of expanding the three key factors is flexible and can be adjusted as needed.
  • Figure 3: Ablation study on mask ration and regularization parameters. We conduct experiments to explore the impact of mask ration $p$ and regularization parameters $\lambda_{transfer}$ and $\lambda_{align}$.