Predicting the Best of N Visual Trackers
Basit Alawode, Sajid Javed, Arif Mahmood, Jiri Matas
TL;DR
This work tackles the variability of state-of-the-art visual trackers across video attributes by introducing the Best of the $N$ Trackers (BofN) meta-tracker, which selects the best single tracker from a pool using a Tracking Performance Prediction Network (TP$^{2}$N) that operates on a few initial frames. TP$^{2}$N leverages self-supervised learning backbones (including ViT-S with DINO) and two training regimes to predict the likely best tracker without running all candidates, enabling video-level and frame-level inference with minimal overhead. Groundtruth labels are generated by evaluating all $N$ trackers on diverse training videos and encoding the outcome as supervision for TP$^{2}$N, which is fine-tuned on a compound dataset combining LaSOT, GOT-10K, and TrackingNet. Empirically, BofN achieves substantial improvements over 17 SOTA trackers across LaSOT, TrackingNet, GOT-10K, and VOT benchmarks, with frame-level predictions providing the largest gains and frame-level overhead remaining manageable, making the approach practical for real-time or near-real-time tracking scenarios. The results demonstrate that a selective, prediction-based tracker choice can outperform multiple concurrent trackers while significantly reducing computational cost, and the framework is extensible to other CV tasks and data-efficient scenarios.
Abstract
We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.
