Progressive Scaling Visual Object Tracking
Jack Hong, Shilin Yan, Zehao Xiao, Jiayin Cai, Xiaolong Jiang, Yao Hu, Henghui Ding
TL;DR
This work tackles the problem of efficiently scaling visual object tracking models by systematically studying model size, training data volume, and input resolution. It introduces DT-Training, a progressive scaling framework that combines small teacher transfer with dual-branch alignment, enabling smooth optimization and iterative growth across training stages via $L_{total}(f; \hat{f}) = L_{clean}(f) + \lambda_{transfer}L_{transfer}(f; \hat{f}) + \lambda_{align}L_{align}(f)$. The approach delivers substantial gains, including a $4.7\%$ improvement on LaSOT when upgrading from ViT-Base to ViT-Large at 384, and achieves a mean AUC of $64.8$ on GTrack Bench, exceeding prior methods by at least $1.4$ points, while maintaining inference speed. Importantly, DT-Training demonstrates strong generalization to multimodal data and transfer to downstream tasks such as object detection, underscoring its practical relevance for robust, scalable vision systems.
Abstract
In this work, we propose a progressive scaling training strategy for visual object tracking, systematically analyzing the influence of training data volume, model size, and input resolution on tracking performance. Our empirical study reveals that while scaling each factor leads to significant improvements in tracking accuracy, naive training suffers from suboptimal optimization and limited iterative refinement. To address this issue, we introduce DT-Training, a progressive scaling framework that integrates small teacher transfer and dual-branch alignment to maximize model potential. The resulting scaled tracker consistently outperforms state-of-the-art methods across multiple benchmarks, demonstrating strong generalization and transferability of the proposed method. Furthermore, we validate the broader applicability of our approach to additional tasks, underscoring its versatility beyond tracking.
