Table of Contents
Fetching ...

Track Anything Annotate: Video annotation and dataset generation of computer vision models

Nikita Ivanov, Mark Klimov, Dmitry Glukhikh, Tatiana Chernysheva, Igor Glukhikh

TL;DR

This paper addresses the bottleneck of creating large labeled datasets for video-based computer vision by prototyping Track Anything Annotate, a tool that couples state-of-the-art segmentation and tracking models to automate annotation. The authors compare three pipelines—OpenCV+FastSAM, FastSAM+XMem, and SAM2+XMem++—and demonstrate that combining SAM2 with XMem++ yields higher accuracy in challenging scenarios at the expense of latency and memory. The prototype generates datasets in YOLO format and provides a web-based demo to illustrate use in practice. The approach offers a scalable solution to rapid dataset generation for long videos and complex scenes, albeit with substantial hardware requirements.

Abstract

Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate

Track Anything Annotate: Video annotation and dataset generation of computer vision models

TL;DR

This paper addresses the bottleneck of creating large labeled datasets for video-based computer vision by prototyping Track Anything Annotate, a tool that couples state-of-the-art segmentation and tracking models to automate annotation. The authors compare three pipelines—OpenCV+FastSAM, FastSAM+XMem, and SAM2+XMem++—and demonstrate that combining SAM2 with XMem++ yields higher accuracy in challenging scenarios at the expense of latency and memory. The prototype generates datasets in YOLO format and provides a web-based demo to illustrate use in practice. The approach offers a scalable solution to rapid dataset generation for long videos and complex scenes, albeit with substantial hardware requirements.

Abstract

Modern machine learning methods require significant amounts of labelled data, making the preparation process time-consuming and resource-intensive. In this paper, we propose to consider the process of prototyping a tool for annotating and generating training datasets based on video tracking and segmentation. We examine different approaches to solving this problem, from technology selection through to final implementation. The developed prototype significantly accelerates dataset generation compared to manual annotation. All resources are available at https://github.com/lnikioffic/track-anything-annotate

Paper Structure

This paper contains 9 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Object selection.
  • Figure 2: Object segmentation.
  • Figure 3: Segmentation of the image with the resulting mask and the image with the mask applied.
  • Figure 4: Segmentation of the image with the received and applied mask when using SAM2.
  • Figure 5: Masks obtained through FastSAM and SAM2.
  • ...and 6 more figures