Table of Contents
Fetching ...

Beyond SOT: Tracking Multiple Generic Objects at Once

Christoph Mayer, Martin Danelljan, Ming-Hsuan Yang, Vittorio Ferrari, Luc Van Gool, Alina Kuznetsova

TL;DR

This work introduces LaGOT, a large-scale benchmark for multi-object Generic Object Tracking (GOT) built on LaSOT to evaluate joint tracking of multiple user-defined targets. It proposes TaMOs, a Transformer-based tracker that processes full-frame inputs with a generic multi-object encoding pool to predict and localize all targets simultaneously, achieving ≈4× faster run-time for 10 targets compared to independent per-object tracking and outperforming single-object trackers on LaGOT. TaMOs also demonstrates strong generalization by achieving state-of-the-art results on single-object GOT benchmarks such as TrackingNet (AUC 84.4%) while maintaining competitive performance on others. The dataset and method jointly advance robust, efficient tracking in open-world scenes, enabling practical deployment in surveillance, video annotation, and robotics.

Abstract

Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.

Beyond SOT: Tracking Multiple Generic Objects at Once

TL;DR

This work introduces LaGOT, a large-scale benchmark for multi-object Generic Object Tracking (GOT) built on LaSOT to evaluate joint tracking of multiple user-defined targets. It proposes TaMOs, a Transformer-based tracker that processes full-frame inputs with a generic multi-object encoding pool to predict and localize all targets simultaneously, achieving ≈4× faster run-time for 10 targets compared to independent per-object tracking and outperforming single-object trackers on LaGOT. TaMOs also demonstrates strong generalization by achieving state-of-the-art results on single-object GOT benchmarks such as TrackingNet (AUC 84.4%) while maintaining competitive performance on others. The dataset and method jointly advance robust, efficient tracking in open-world scenes, enabling practical deployment in surveillance, video annotation, and robotics.

Abstract

Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video. While the task has received much attention in the last decades, researchers have almost exclusively focused on the single object setting. Multi-object GOT benefits from a wider applicability, rendering it more attractive in real-world applications. We attribute the lack of research interest into this problem to the absence of suitable benchmarks. In this work, we introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence. Our benchmark allows users to tackle key remaining challenges in GOT, aiming to increase robustness and reduce computation through joint tracking of multiple objects simultaneously. In addition, we propose a transformer-based GOT tracker baseline capable of joint processing of multiple objects through shared computation. Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark. In addition, our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%. Our benchmark, code, and trained models will be made publicly available.
Paper Structure (31 sections, 6 equations, 11 figures, 11 tables)

This paper contains 31 sections, 6 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Multiple Object trackers (MOT) track all the objects corresponding to classes in a predefined category list, while all other objects are ignored. Single Object Tracking (SOT) methods focus on tracking only a single user-specified object per video. Thus, when encountered with multiple objects, such methods must resort to independent tracking of each object. This leads to a directly linear increase in computation. Our tracker can track multiple generic objects jointly that are defined via user-specified bounding boxes, leading to the opportunity of computational savings and to exploit inter-object information for improved robustness. The box colors correspond to track IDs.
  • Figure 2: Examples of the annotated objects in the video sequences of our LaGOT dataset. The objects are annotated at $10$ FPS. Notice the diversity of the annotated media as well as the complexity of the scenes.
  • Figure 3: Overview of our tracker TaMOs for joint tracking of multiple targets. First, we extract features from training and test frames. All objects in the training frame are encoded jointly with a multi-object encoding and passed to the model predictor together with the training frame features. The model predictor produces target models $\hat{\theta_i}$ together with enhanced test features. We apply an fpn on the enhanced output features to generate higher resolution test features. Finally, we predict the bounding box of each target by applying the target model $\hat{\theta_i}$ for each target.
  • Figure 4: Success plot, showing $\text{OP}_T$, on LaGOT (AUC is reported in the legend). Tracking Precision-Recall curve on LaGOT -- VOTLT is reported in the legend (the highest F1-score).
  • Figure 5: Success plot, showing $\text{OP}_T$, on LaGOT (AUC is reported in the legend). Tracking Precision-Recall curve on LaGOT -- VOTLT is reported in the legend (the highest F1-score).
  • ...and 6 more figures