Table of Contents
Fetching ...

Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

Enze Xie, Jiaho Lyu, Daiqing Wu, Huawen Shen, Yu Zhou

TL;DR

This work addresses the high cost of pixel-level scene text segmentation annotations and SAM's limitations when applied with coarse prompts. It introduces Char-SAM, a training-free pipeline comprising Character Bounding-box Refinement (CBR) to produce character-level prompts and Character Glyph Refinement (CGR) to inject glyph-based prompts, guiding SAM to generate accurate pixel-level masks. The approach yields high-quality annotations on real datasets, achieving competitive zero-shot performance on TextSeg and improving pseudo-annotated COCO-Text/MLT17 datasets, validated through extensive ablations. The method enables low-cost, scalable generation of scene text segmentation data with practical benefits for downstream tasks like erasure, editing, and recognition pipelines.

Abstract

The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.

Char-SAM: Turning Segment Anything Model into Scene Text Segmentation Annotator with Character-level Visual Prompts

TL;DR

This work addresses the high cost of pixel-level scene text segmentation annotations and SAM's limitations when applied with coarse prompts. It introduces Char-SAM, a training-free pipeline comprising Character Bounding-box Refinement (CBR) to produce character-level prompts and Character Glyph Refinement (CGR) to inject glyph-based prompts, guiding SAM to generate accurate pixel-level masks. The approach yields high-quality annotations on real datasets, achieving competitive zero-shot performance on TextSeg and improving pseudo-annotated COCO-Text/MLT17 datasets, validated through extensive ablations. The method enables low-cost, scalable generation of scene text segmentation data with practical benefits for downstream tasks like erasure, editing, and recognition pipelines.

Abstract

The recent emergence of the Segment Anything Model (SAM) enables various domain-specific segmentation tasks to be tackled cost-effectively by using bounding boxes as prompts. However, in scene text segmentation, SAM can not achieve desirable performance. The word-level bounding box as prompts is too coarse for characters, while the character-level bounding box as prompts suffers from over-segmentation and under-segmentation issues. In this paper, we propose an automatic annotation pipeline named Char-SAM, that turns SAM into a low-cost segmentation annotator with a Character-level visual prompt. Specifically, leveraging some existing text detection datasets with word-level bounding box annotations, we first generate finer-grained character-level bounding box prompts using the Character Bounding-box Refinement CBR module. Next, we employ glyph information corresponding to text character categories as a new prompt in the Character Glyph Refinement (CGR) module to guide SAM in producing more accurate segmentation masks, addressing issues of over-segmentation and under-segmentation. These modules fully utilize the bbox-to-mask capability of SAM to generate high-quality text segmentation annotations automatically. Extensive experiments on TextSeg validate the effectiveness of Char-SAM. Its training-free nature also enables the generation of high-quality scene text segmentation datasets from real-world datasets like COCO-Text and MLT17.
Paper Structure (15 sections, 3 figures, 4 tables)

This paper contains 15 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Failure cases of SAM on TextSeg xu2021rethinking. The yellow and red rectangles in (a) and (b) represent the word-level and character-level bounding box prompts provided to SAM respectively. Red rectangular with dashed lines in (c) and (d) highlight the failure segmentations. Zoom in and out for a better view.
  • Figure 2: The overall framework of our Char-SAM, which mainly consists of the Character Bbox Refinement (CBR) module, the Character Glyph Refinement (CGR) module and SAM architecture.
  • Figure 3: Comparison of annotation quality of different scene text segmentation datasets.