Table of Contents
Fetching ...

Scalable Mask Annotation for Video Text Spotting

Haibin He, Jing Zhang, Mengyang Xu, Juhua Liu, Bo Du, Dacheng Tao

TL;DR

This work introduces SAMText, a scalable mask-annotation pipeline that leverages the Segment Anything Model (SAM) to convert existing video text bounding boxes into high-quality segmentation masks, addressing limitations of traditional quadrilateral annotations in video text spotting. By processing five public datasets, the authors create SAMText-9M, a large-scale resource with approximately 2,400 video clips and over 9 million masks, enabling mask-based training and exploration of curved-text scenarios. They provide extensive analyses of the generated masks, including IoU, CoV, and spatial distributions, and discuss promising directions such as data/model scalability and character-level mask generation. The dataset and methodology offer a foundation for improved video text detection, recognition, and segmentation, with code and data to follow.

Abstract

Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting. SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset. The code and dataset will be released at \url{https://github.com/ViTAE-Transformer/SAMText}.

Scalable Mask Annotation for Video Text Spotting

TL;DR

This work introduces SAMText, a scalable mask-annotation pipeline that leverages the Segment Anything Model (SAM) to convert existing video text bounding boxes into high-quality segmentation masks, addressing limitations of traditional quadrilateral annotations in video text spotting. By processing five public datasets, the authors create SAMText-9M, a large-scale resource with approximately 2,400 video clips and over 9 million masks, enabling mask-based training and exploration of curved-text scenarios. They provide extensive analyses of the generated masks, including IoU, CoV, and spatial distributions, and discuss promising directions such as data/model scalability and character-level mask generation. The dataset and methodology offer a foundation for improved video text detection, recognition, and segmentation, with code and data to follow.

Abstract

Video text spotting refers to localizing, recognizing, and tracking textual elements such as captions, logos, license plates, signs, and other forms of text within consecutive video frames. However, current datasets available for this task rely on quadrilateral ground truth annotations, which may result in including excessive background content and inaccurate text boundaries. Furthermore, methods trained on these datasets often produce prediction results in the form of quadrilateral boxes, which limits their ability to handle complex scenarios such as dense or curved text. To address these issues, we propose a scalable mask annotation pipeline called SAMText for video text spotting. SAMText leverages the SAM model to generate mask annotations for scene text images or video frames at scale. Using SAMText, we have created a large-scale dataset, SAMText-9M, that contains over 2,400 video clips sourced from existing datasets and over 9 million mask annotations. We have also conducted a thorough statistical analysis of the generated masks and their quality, identifying several research topics that could be further explored based on this dataset. The code and dataset will be released at \url{https://github.com/ViTAE-Transformer/SAMText}.
Paper Structure (10 sections, 6 figures, 1 table)

This paper contains 10 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of the SAMText pipeline that builds upon the SAM approach Related31 to generate mask annotations for scene text images or video frames at scale. The input bounding box may be sourced from existing annotations or derived from a scene text detection model.
  • Figure 2: Some visualization results of the generated masks in five datasets using the SAMText pipeline. The top row shows the scene text frames while the bottom row shows the generated masks.
  • Figure 3: The distribution of IoU between the generated masks and ground truth masks in the COCO-Text training dataset veit2016coco.
  • Figure 4: (a) The mask size distributions of the ICDAR15, RoadText-1k, LSVDT, and DSText datasets. Masks exceeding 10,000 pixels are excluded from the statistics. (b) The mask size distributions of the BOVText datasets. Masks exceeding 80,000 pixels are excluded from the statistics.
  • Figure 5: (a) The distribution of IoU between the generated masks and ground truth bounding boxes in each dataset. (b) The CoV distribution of mask size changes for the same individual in consecutive frames in all five datasets, excluding the CoV scores exceeding 1.0 from the statistics.
  • ...and 1 more figures