Track Anything Annotate: Video annotation and dataset generation of computer vision models
Nikita Ivanov, Mark Klimov, Dmitry Glukhikh, Tatiana Chernysheva, Igor Glukhikh
TL;DR
This paper addresses the bottleneck of creating large labeled datasets for video-based computer vision by prototyping Track Anything Annotate, a tool that couples state-of-the-art segmentation and tracking models to automate annotation. The authors compare three pipelines—OpenCV+FastSAM, FastSAM+XMem, and SAM2+XMem++—and demonstrate that combining SAM2 with XMem++ yields higher accuracy in challenging scenarios at the expense of latency and memory. The prototype generates datasets in YOLO format and provides a web-based demo to illustrate use in practice. The approach offers a scalable solution to rapid dataset generation for long videos and complex scenes, albeit with substantial hardware requirements.
Abstract
Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate
