Table of Contents
Fetching ...

Collecting Consistently High Quality Object Tracks with Minimal Human Involvement by Using Self-Supervised Learning to Detect Tracker Errors

Samreen Anjum, Suyog Jain, Danna Gurari

TL;DR

This work tackles the challenge of producing consistently high-quality object tracks with minimal human input. It introduces SSLTrack, a tracker-agnostic hybrid framework that leverages offline self-supervised learning to tailor object representations from unlabeled videos and online similarity monitoring to decide when manual re-localization is needed. When the online similarity between a tracker’s prediction and the reference template drops below a threshold, a neighborhood-based frame selection strategy prompts a single human annotation, reducing unnecessary interactions. The approach is validated on GMOT-40, ImageNet VID, and MOT15, showing improved recall and MOTA with less annotation effort, particularly for small, fast-moving, or occluded objects, and demonstrating robust applicability across different trackers and tracking scenarios.

Abstract

We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.

Collecting Consistently High Quality Object Tracks with Minimal Human Involvement by Using Self-Supervised Learning to Detect Tracker Errors

TL;DR

This work tackles the challenge of producing consistently high-quality object tracks with minimal human input. It introduces SSLTrack, a tracker-agnostic hybrid framework that leverages offline self-supervised learning to tailor object representations from unlabeled videos and online similarity monitoring to decide when manual re-localization is needed. When the online similarity between a tracker’s prediction and the reference template drops below a threshold, a neighborhood-based frame selection strategy prompts a single human annotation, reducing unnecessary interactions. The approach is validated on GMOT-40, ImageNet VID, and MOT15, showing improved recall and MOTA with less annotation effort, particularly for small, fast-moving, or occluded objects, and demonstrating robust applicability across different trackers and tracking scenarios.

Abstract

We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.
Paper Structure (24 sections, 1 equation, 6 figures, 2 tables)

This paper contains 24 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our hybrid object tracking framework. The red bounding box represents a human drawn one and the green box represents an automated tracker's prediction. For each frame, the online frame selection module uses the self-supervised object representation model (trained offline) to compute features for the last reference template and automated tracker's template. If the feature similarity is below a threshold, a frame is selected for manual annotation using our neighborhood search module. Otherwise, the automated tracker continues on successive frames.
  • Figure 2: Performance comparison of our method against other state-of-the-art methods on MOT15 dataset. Our method using both automated trackers, OSTrack and Stark, outperforms all other methods and achieves higher accuracy at a lower annotation rate.
  • Figure 3: Tracking performance with respect to object attributes for all the objects in three datasets. Relative object size is the ratio of the object size with respect to the input image, displacement is the speed of the object, occlusion ratio is the level of occlusion, and changes in orientation is the number of times an object changes its orientation across the video. We see that our method outperforms the uniform baseline approach particularly on challenging scenarios when objects are smaller, moving fast, encounter occlusions, and change their orientation more in their trajectories.
  • Figure 4: Performance comparison of our method against the uniform selection across different categories in both datasets. Our method outperforms the baseline on 8 out of 10 categories for GMOT-40 and 23 out of 30 categories for ImageNet VID.
  • Figure 5: t-SNE visualization of object embeddings from four different representation models. Each data point in each map represents the ground truth crop of a unique object in the GMOT dataset (total = 1,944). Embeddings are extracted using (a) off-the-shelf VGG-16 model pre-trained on ImageNet, (b) SimCLR model trained on object proposals extracted from GMOT-40 (our model), (c) SimCLR model trained on whole frames from GMOT-40, and (d) SimCLR model trained on all ground truth instance crops in the GMOT-40 dataset. Embeddings extracted using our model is better separated than the VGG model and SimCLR model trained on whole frames, while similar to the one trained on GT crops (e.g., balloon, bird). Best viewed in color.
  • ...and 1 more figures