Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Maoyuan Ye; Jing Zhang; Juhua Liu; Chenyu Liu; Baocai Yin; Cong Liu; Bo Du; Dacheng Tao

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, Dacheng Tao

TL;DR

This paper first turns SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach, and uses this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset.

Abstract

The Segment Anything Model (SAM), a profound vision foundation model pretrained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in segmentation across four hierarchies, including pixel-level text, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality pixel-level text segmentation (TS) model through a parameter-efficient fine-tuning approach. We use this TS model to iteratively generate the pixel-level text labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation (PS) mode. In the AMG mode, Hi-SAM segments pixel-level text foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the PS mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for pixel-level text segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring $20\times$ fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

TL;DR

Abstract

fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

Paper Structure (25 sections, 5 equations, 10 figures, 18 tables)

This paper contains 25 sections, 5 equations, 10 figures, 18 tables.

Introduction
Related Work
Specialist Models for Different Text Hierarchies
Adapting Vision Foundation Model for Text Tasks
Segment Anything Model and Follow-ups
Methodology
Preliminary
Overview of Hi-SAM
Feature Extraction
Pixel-level Text Segmentation
Word, Text-line, and Paragraph Segmentation
Layout Analysis
Training of Hi-SAM
Inference of Hi-SAM
Experiments
...and 10 more sections

Figures (10)

Figure 1: Hi-SAM can perform pixel-level text segmentation, word segmentation, text-line segmentation, paragraph segmentation, and layout analysis in automatic mask generation mode. Hi-SAM also supports promptable segmentation. Given a single-point click on one word, Hi-SAM predicts the corresponding word, text-line, and paragraph masks.
Figure 2: The overview of Hi-SAM. We show the automatic mask generation mode here. With the image embedding, for pixel-level text segmentation, self-prompting module generates implicit prompt tokens for the mask decoder (S-Decoder). Based on the pixel-level text mask, a certain number of foreground points are sampled and then embedded by the frozen prompt encoder. A customized hierarchical mask decoder (H-Decoder) segments word, text-line, and paragraph masks for each point prompt. Layout analysis can be achieved with the hierarchical outputs from the H-Decoder in passing.
Figure 3: The structure details of S-Decoder. 'Trans. Conv.' and 'T2I Attn.' are transposed convolution and token-to-image attention, respectively. 'LR Mask' and 'HR Mask' denote the predicted low- and high-resolution mask logits, respectively.
Figure 4: Annotation samples in HierText generated by SAM-TS automatically. Best view on screen with zooming in.
Figure 5: Trainable parameter statistics of Hi-SAM with ViT-B, ViT-L, ViT-H backbones, respectively.
...and 5 more figures

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

TL;DR

Abstract

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)