Table of Contents
Fetching ...

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Haiyang Yu, Teng Fu, Bin Li, Xiangyang Xue

TL;DR

EAFormer introduces edge awareness into scene text segmentation by coupling a text edge extractor with an edge‑guided encoder and a lightweight MLP decoder. By filtering Canny edges with a text area mask and fusing edge cues at the earliest encoding stage via symmetric cross‑attention, the method improves segmentation accuracy, especially at text boundaries. Loss design relies on two cross‑entropy terms for text masks and text areas, enabling training with semantic annotations only. Extensive experiments on six benchmarks, including re‑annotated COCO_TS and MLT_S, show state‑of‑the‑art performance and particularly strong edge‑region gains, illustrating practical benefits for downstream tasks like text erasing.

Abstract

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

TL;DR

EAFormer introduces edge awareness into scene text segmentation by coupling a text edge extractor with an edge‑guided encoder and a lightweight MLP decoder. By filtering Canny edges with a text area mask and fusing edge cues at the earliest encoding stage via symmetric cross‑attention, the method improves segmentation accuracy, especially at text boundaries. Loss design relies on two cross‑entropy terms for text masks and text areas, enabling training with semantic annotations only. Extensive experiments on six benchmarks, including re‑annotated COCO_TS and MLT_S, show state‑of‑the‑art performance and particularly strong edge‑region gains, illustrating practical benefits for downstream tasks like text erasing.

Abstract

Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.
Paper Structure (16 sections, 6 equations, 6 figures, 6 tables)

This paper contains 16 sections, 6 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Results comparison of the downstream application (text erasing) with different text masks as input. More accurate segmentation at text edges is beneficial to the text erasing task since less text pixels are wrongly predicted and more background information is reserved for the inpainting model.
  • Figure 2: Feature clustering results of PGTSNet and EAFormer. The visualization indicates that PGTSNet can hardly well perceive text edges compared with EAFormer.
  • Figure 3: Overall structure of EAFormer. EAFormer consists of three modules: text edge extractor, edge-guided encoder and text segmentation decoder. 'SA', 'CA', and 'FFN' represent self-attention, cross-attention, and feed-forward network, respectively.
  • Figure 4: Comparison between original and modified annotations. The original datasets have the problems of missing and inaccurate annotations. Using re-annotated datasets to train the proposed method makes the experimental results more convincing.
  • Figure 5: Visualizations of qualitative comparison between different methods or training with different annotations. 'OA' and 'RA' indicate training EAFormer with original annotations and re-annotations, respectively.
  • ...and 1 more figures