Focus Entirety and Perceive Environment for Arbitrary-Shaped Text Detection
Xu Han, Junyu Gao, Chuang Yang, Yuan Yuan, Qi Wang
TL;DR
FEPE introduces Focus Entirety Module (FEM) and Perceive Environment Module (PEM) to augment segmentation-based arbitrary-shaped text detection with instance-level cohesion and region-level context. The model trains with a multi-task loss L = $\lambda_{1} \mathcal{L}_{k} + \lambda_{2} \mathcal{L}_{t} + \lambda_{3} \mathcal{L}_{su} + \lambda_{4} \mathcal{L}_{sc}$, jointly supervising kernel, text, surrounding, and scale predictions. Empirical results on CTW1500, Total-Text, ICDAR2015, and MSRA-TD500 show FEPE achieves SOTA or near-SOTA performance with competitive speed, validating that integrating instance- and region-level signals improves robustness to noise and scale variation. FEPE’s ablations highlight the contributions of FEM and PEM, the optimal choice of surrounding-perception window size $k$, and the benefits of targeted pre-training. The approach provides practical impact for real-time, reliable text detection in diverse scenes and shapes.
Abstract
Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.
