Table of Contents
Fetching ...

Predicting the Best of N Visual Trackers

Basit Alawode, Sajid Javed, Arif Mahmood, Jiri Matas

TL;DR

This work tackles the variability of state-of-the-art visual trackers across video attributes by introducing the Best of the $N$ Trackers (BofN) meta-tracker, which selects the best single tracker from a pool using a Tracking Performance Prediction Network (TP$^{2}$N) that operates on a few initial frames. TP$^{2}$N leverages self-supervised learning backbones (including ViT-S with DINO) and two training regimes to predict the likely best tracker without running all candidates, enabling video-level and frame-level inference with minimal overhead. Groundtruth labels are generated by evaluating all $N$ trackers on diverse training videos and encoding the outcome as supervision for TP$^{2}$N, which is fine-tuned on a compound dataset combining LaSOT, GOT-10K, and TrackingNet. Empirically, BofN achieves substantial improvements over 17 SOTA trackers across LaSOT, TrackingNet, GOT-10K, and VOT benchmarks, with frame-level predictions providing the largest gains and frame-level overhead remaining manageable, making the approach practical for real-time or near-real-time tracking scenarios. The results demonstrate that a selective, prediction-based tracker choice can outperform multiple concurrent trackers while significantly reducing computational cost, and the framework is extensible to other CV tasks and data-efficient scenarios.

Abstract

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.

Predicting the Best of N Visual Trackers

TL;DR

This work tackles the variability of state-of-the-art visual trackers across video attributes by introducing the Best of the Trackers (BofN) meta-tracker, which selects the best single tracker from a pool using a Tracking Performance Prediction Network (TPN) that operates on a few initial frames. TPN leverages self-supervised learning backbones (including ViT-S with DINO) and two training regimes to predict the likely best tracker without running all candidates, enabling video-level and frame-level inference with minimal overhead. Groundtruth labels are generated by evaluating all trackers on diverse training videos and encoding the outcome as supervision for TPN, which is fine-tuned on a compound dataset combining LaSOT, GOT-10K, and TrackingNet. Empirically, BofN achieves substantial improvements over 17 SOTA trackers across LaSOT, TrackingNet, GOT-10K, and VOT benchmarks, with frame-level predictions providing the largest gains and frame-level overhead remaining manageable, making the approach practical for real-time or near-real-time tracking scenarios. The results demonstrate that a selective, prediction-based tracker choice can outperform multiple concurrent trackers while significantly reducing computational cost, and the framework is extensible to other CV tasks and data-efficient scenarios.

Abstract

We observe that the performance of SOTA visual trackers surprisingly strongly varies across different video attributes and datasets. No single tracker remains the best performer across all tracking attributes and datasets. To bridge this gap, for a given video sequence, we predict the "Best of the N Trackers", called the BofN meta-tracker. At its core, a Tracking Performance Prediction Network (TP2N) selects a predicted best performing visual tracker for the given video sequence using only a few initial frames. We also introduce a frame-level BofN meta-tracker which keeps predicting best performer after regular temporal intervals. The TP2N is based on self-supervised learning architectures MocoV2, SwAv, BT, and DINO; experiments show that the DINO with ViT-S as a backbone performs the best. The video-level BofN meta-tracker outperforms, by a large margin, existing SOTA trackers on nine standard benchmarks - LaSOT, TrackingNet, GOT-10K, VOT2019, VOT2021, VOT2022, UAV123, OTB100, and WebUAV-3M. Further improvement is achieved by the frame-level BofN meta-tracker effectively handling variations in the tracking scenarios within long sequences. For instance, on GOT-10k, BofN meta-tracker average overlap is 88.7% and 91.1% with video and frame-level settings respectively. The best performing tracker, RTS, achieves 85.20% AO. On VOT2022, BofN expected average overlap is 67.88% and 70.98% with video and frame level settings, compared to the best performing ARTrack, 64.12%. This work also presents an extensive evaluation of competitive tracking methods on all commonly used benchmarks, following their protocols. The code, the trained models, and the results will soon be made publicly available on https://github.com/BasitAlawode/Best_of_N_Trackers.
Paper Structure (20 sections, 3 figures, 10 tables)

This paper contains 20 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: SOTA visual trackers -- performance variation. Dependence on video attributes on (a) LaSoT and (b) UAV123. The proposed BofN dominates for all attributes on both datasets. The umber of videos where a tracker is the top performer on (c) LaSoT, (d) UAV123 and (e) VOT2022 datasets.
  • Figure 2: Structure of the proposed BofN -- "Best of the $N$ Trackers".
  • Figure 3: Visual results of the proposed BofN -- "Best of the $N$ Trackers" and its comparison with existing SOTA trackers, including ToMP mayer2022transforming, RTS Paul2022, AiATrack gao2022aiatrack, OSTrack ye2022ostrack, GRM gao2023generalized, ARTrack Wei_2023_CVPR, DropTrack dropmae2023, and CiteTracker li2023citetracker on 12 challenging sequences selected from LaSOT 8954084 and VOT2022 Kristan_2022_ICCV datasets. Frame indexes and sequence names are shown for each sequence. Our proposed BofN tracker has consistently performed well against these challenges as compared to the other trackers.