Collecting Consistently High Quality Object Tracks with Minimal Human Involvement by Using Self-Supervised Learning to Detect Tracker Errors
Samreen Anjum, Suyog Jain, Danna Gurari
TL;DR
This work tackles the challenge of producing consistently high-quality object tracks with minimal human input. It introduces SSLTrack, a tracker-agnostic hybrid framework that leverages offline self-supervised learning to tailor object representations from unlabeled videos and online similarity monitoring to decide when manual re-localization is needed. When the online similarity between a tracker’s prediction and the reference template drops below a threshold, a neighborhood-based frame selection strategy prompts a single human annotation, reducing unnecessary interactions. The approach is validated on GMOT-40, ImageNet VID, and MOT15, showing improved recall and MOTA with less annotation effort, particularly for small, fast-moving, or occluded objects, and demonstrating robust applicability across different trackers and tracking scenarios.
Abstract
We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.
