CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

Xi Zhao; Wei Feng; Zheng Zhang; Jingjing Lv; Xin Zhu; Zhangang Lin; Jinghe Hu; Jingping Shao

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

Xi Zhao, Wei Feng, Zheng Zhang, Jingjing Lv, Xin Zhu, Zhangang Lin, Jinghe Hu, Jingping Shao

TL;DR

CBNet introduces a lightweight plug-and-play framework for segmentation-based scene text detection that improves kernel segmentation through a global–local context module and reconstructs text boundaries with a boundary-guided expansion driven by a learnable distance map. The global context models cross-instance pixel relationships, while the local context uses per-text-distance cues to refine segmentation; together they produce a stronger text kernel. The boundary-guided expansion uses a predicted distance map to adaptively grow contours, achieving favorable accuracy–speed trade-offs and simpler post-processing. Across curve, multi-oriented, and multilingual benchmarks, CBNet consistently improves performance with minimal parameter overhead, demonstrating strong generalization and practical impact for real-time text detection systems.

Abstract

Recently, segmentation-based methods are quite popular in scene text detection, which mainly contain two steps: text kernel segmentation and expansion. However, the segmentation process only considers each pixel independently, and the expansion process is difficult to achieve a favorable accuracy-speed trade-off. In this paper, we propose a Context-aware and Boundary-guided Network (CBN) to tackle these problems. In CBN, a basic text detector is firstly used to predict initial segmentation results. Then, we propose a context-aware module to enhance text kernel feature representations, which considers both global and local contexts. Finally, we introduce a boundary-guided module to expand enhanced text kernels adaptively with only the pixels on the contours, which not only obtains accurate text boundaries but also keeps high speed, especially on high-resolution output maps. In particular, with a lightweight backbone, the basic detector equipped with our proposed CBN achieves state-of-the-art results on several popular benchmarks, and our proposed CBN can be plugged into several segmentation-based methods. Code is available at https://github.com/XiiZhao/cbn.pytorch.

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

TL;DR

Abstract

Paper Structure (26 sections, 11 equations, 10 figures, 16 tables)

This paper contains 26 sections, 11 equations, 10 figures, 16 tables.

Introduction
Related Work
Methodology
Overall Architecture
Context-aware Text Kernel Segmentation
Global Text-contextual
Local Text-contextual
Boundary-guided Text Kernel Expansion
Loss Function
Experiments
Datasets
Implementation Details
Ablation Studies
Effectiveness of Context-aware Module
Effectiveness of Boundary-guided Module
...and 11 more sections

Figures (10)

Figure 1: Most existing segmentation-based methods contain two steps: text kernel segmentation and expansion. (a) and (b) show that considering the relationship between pixels can improve segmentation results. The color masks in (a) and (b) represent the context of pixels. (c) and (d) show that using the points on the contour to expand the text kernel adaptively achieves a favorable accuracy-speed trade-off. The red pixels in (c) and (d) represent the pixels that need to participate in the kernel expansion.
Figure 2: Detailed architecture of the proposed CBN. The context-aware module can enhance the initial text kernel segmentation result using global and local contexts. The boundary-guided module expands the enhanced text kernel with the predicted distance map. (b) illustrates the architecture of the context-aware text kernel segmentation module. "$L$" and "$G$" represent the local and global contexts respectively.
Figure 3: The procedure of boundary-guided kernel expansion algorithm. "CD" refers to the contour detection. "CE" represents the contour expansion. "RE" refers to refining the initial boundary. The yellow dashed box shows the details of contour expansion.
Figure 4: Visualization results of the basic detector with and without the GL-CAM module. "GL-CAM" means our proposed global and local context-aware module.
Figure 5: Visualization results of PAN equipped with GL-CAM by using different post-processing. "PA", "DB", and "BG" represent pixel aggregation, post-processing of DBNet, and our boundary-guided kernel expansion respectively.
...and 5 more figures

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

TL;DR

Abstract

CBNet: A Plug-and-Play Network for Segmentation-Based Scene Text Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (10)