Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

Xu Han; Junyu Gao; Chuang Yang; Yuan Yuan; Qi Wang

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

Xu Han, Junyu Gao, Chuang Yang, Yuan Yuan, Qi Wang

TL;DR

FEPE introduces Focus Entirety Module (FEM) and Perceive Environment Module (PEM) to augment segmentation-based arbitrary-shaped text detection with instance-level cohesion and region-level context. The model trains with a multi-task loss L = $\lambda_{1} \mathcal{L}_{k} + \lambda_{2} \mathcal{L}_{t} + \lambda_{3} \mathcal{L}_{su} + \lambda_{4} \mathcal{L}_{sc}$, jointly supervising kernel, text, surrounding, and scale predictions. Empirical results on CTW1500, Total-Text, ICDAR2015, and MSRA-TD500 show FEPE achieves SOTA or near-SOTA performance with competitive speed, validating that integrating instance- and region-level signals improves robustness to noise and scale variation. FEPE’s ablations highlight the contributions of FEM and PEM, the optimal choice of surrounding-perception window size $k$, and the benefits of targeted pre-training. The approach provides practical impact for real-time, reliable text detection in diverse scenes and shapes.

Abstract

Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

TL;DR

, jointly supervising kernel, text, surrounding, and scale predictions. Empirical results on CTW1500, Total-Text, ICDAR2015, and MSRA-TD500 show FEPE achieves SOTA or near-SOTA performance with competitive speed, validating that integrating instance- and region-level signals improves robustness to noise and scale variation. FEPE’s ablations highlight the contributions of FEM and PEM, the optimal choice of surrounding-perception window size

, and the benefits of targeted pre-training. The approach provides practical impact for real-time, reliable text detection in diverse scenes and shapes.

Abstract

Paper Structure (27 sections, 16 equations, 11 figures, 10 tables, 2 algorithms)

This paper contains 27 sections, 16 equations, 11 figures, 10 tables, 2 algorithms.

Introduction
Related work
Regression-based methods
Connected-component-based methods
Segmentation-based methods
Method
Overall Structure
Focus Entirety Module
Perceive Environment Module
Optimization Function
Kernel Segemention Loss
Text Segemention Loss
Regression Loss
Experiment
Datasets
...and 12 more sections

Figures (11)

Figure 1: Illustration of the multi-level information extraction for existing segmentation-based methods and ours. (a) Existing segmentation-based methods pse, db, db++ only focus on pixel-level information. (b) Our method further extracts region-level and instance-level features to suppress the noise.
Figure 2: The overall framework of the proposed FEPE. During the inference stage, only the feature extraction module, feature fusion module, kernel prediction layer, and post-processing are retained, and the others can be removed. $D^\prime$, $r^\prime$, $A$, and $L^\prime$ represent the expanding distance, expand factor, area, and perimeter of the kernel.
Figure 3: The visualization of FEM and PEM. (a) FEM focuses on the scale of instance, the activation value of pixels belonging to large-scale is high. (b) PEM perceives the positive distribution of surroundings. The larger the positive sample the larger the label value. (c) The kernel regions are labeled in orange in the left image. Different instances are marked with a distinct color, and the value is the area of the corresponding instance. (d) The kernel regions are labeled in orange in the left image. The target pixel is marked with red. Its four surrounding map value is generated by the positive pixel number of the purple, green, grey, and blue region.
Figure 4: The comparison with the overall pipeline of other advanced methods. It provides a comprehensive comparison that describes the hierarchy of features learned by each method.
Figure 5: The generation process of the text map, kernel map, scale map, and surrounding map used in the experiments.
...and 6 more figures

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

TL;DR

Abstract

Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)