Table of Contents
Fetching ...

ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

Xike Zhang, Maoyuan Ye, Juhua Liu, Bo Du

Abstract

Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3$\times$ inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

ET-SAM: Efficient Point Prompt Prediction in SAM for Unified Scene Text Detection and Layout Analysis

Abstract

Previous works based on Segment Anything Model (SAM) have achieved promising performance in unified scene text detection and layout analysis. However, the typical reliance on pixel-level text segmentation for sampling thousands of foreground points as prompts leads to unsatisfied inference latency and limited data utilization. To address above issues, we propose ET-SAM, an Efficient framework with two decoders for unified scene Text detection and layout analysis based on SAM. Technically, we customize a lightweight point decoder that produces word heatmaps for achieving a few foreground points, thereby eliminating excessive point prompts and accelerating inference. Without the dependence on pixel-level segmentation, we further design a joint training strategy to leverage existing data with heterogeneous text-level annotations. Specifically, the datasets with multi-level, word-level only, and line-level only annotations are combined in parallel as a unified training set. For these datasets, we introduce three corresponding sets of learnable task prompts in both the point decoder and hierarchical mask decoder to mitigate discrepancies across datasets.Extensive experiments demonstrate that, compared to the previous SAM-based architecture, ET-SAM achieves about 3 inference acceleration while obtaining competitive performance on HierText, and improves an average of 11.0% F-score on Total-Text, CTW1500, and ICDAR15.

Paper Structure

This paper contains 20 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: (a) Compared to Hi-SAM hi-sam, ET-SAM achieves $3\times$ inference acceleration via predicting sparse points as visual prompts, significantly reducing the latency at hierarchical segmentation and post-processing stages. (b) To facilitate data scalability, we design a joint training strategy to leverage samples with mixing text-level annotations.
  • Figure 2: Overview of ET-SAM framework. Extracted from the word heatmap produced by our point decoder, sparse points are used as visual prompts, thereby accelerating subsequent inference. Under the guide of point prompts and specific mask task prompts, HM-Decoder segments word, word group, text-line, and paragraph masks, which are finally used for achieving layout analysis via a union-find algorithm.
  • Figure 3: Structure of Point Decoder. 'Trans. Conv.' and 'T2I Attn.' denote transposed convolution and token-to-image attention, respectively.
  • Figure 4: A visualization of target word-centric Heatmap.
  • Figure 5: Structure of HM-Decoder. Four output tokens charge in segmentation at word, word group, text-line, and paragraph level, under the guide of point prompt and learnable mask task prompt tokens. 'Interpolate' indicates interpolating the spatial resolution to $384\times384$. The IoU prediction process is omitted here.
  • ...and 3 more figures