TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

Jiahao Lyu; Jin Wei; Gangyan Zeng; Zeng Li; Enze Xie; Wei Wang; Yu Zhou

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

Jiahao Lyu, Jin Wei, Gangyan Zeng, Zeng Li, Enze Xie, Wei Wang, Yu Zhou

TL;DR

TextBlockV2 tackles the challenge of jointly detecting and recognizing scene text without relying on fine-grained localization. It introduces a coarse block-detection stage paired with a PLM-based recognizer, augmented by a clustering-based text block generation and a Unified Vision-Language Mask to fuse vision and language. The method achieves competitive or state-of-the-art results on ICDAR2015, Total-Text, and SCUT-CTW1500, and demonstrates the potential of detection-free spotting with PLMs and even LLMs. By leveraging rich language priors and a streamlined block-centric paradigm, it improves recognition in challenging scenarios like multi-line, occluded, or reoriented text.

Abstract

Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

TL;DR

Abstract

Paper Structure (22 sections, 8 equations, 8 figures, 7 tables)

This paper contains 22 sections, 8 equations, 8 figures, 7 tables.

Introduction
Related Works
Scene Text Spotting
Detection-Oriented Scene Text Spotter
Recognition-oriented Scene Text Spotter
Enhancing Vision Tasks with Pre-trained Language Models
Methodology
Overall Architecture
Detection Label Generation
PLM Recognition Block
Unified Visual-Language Mask
Training and Inference
Experiments and Results
Datasets
Implementation Details
...and 7 more sections

Figures (8)

Figure 1: Illustration of Precise Detection and TextBlock Detection. Precise detection aims to detect text units, such as words or phases as shown in the second column. Our proposed detection method, based on text block, reduces the difficulty of detection. The yellow arrows represent the natural reading order in Chinese.
Figure 2: The overview pipeline of TextBlockV2. The scene text image is fed into the TextBlock detection module, which is implemented by Mask R-CNNhe2017mask. Then block cuttings are patched to visual tokens. The Pre-trained Language Model is regarded as a scene text recognizer, extracting texts from block cuttings. The PLM block can be decoder-only or encoder-decoder architecture.
Figure 3: Comparison of the text block generation pipeline between TextBlock and TextBlockV2. The red boxes represent semantically appropriate text blocks, while the green ones do not.
Figure 4: Comparison of three types of masks: yellow, blue, and grey blocks represent vision tokens, language tokens, and masked tokens, respectively. (a) corresponds to the typical visual mask used for bi-directional attention. (b) represents the typical causal mask utilized in language models. (c) signifies our proposed unified vision-language mask that takes into account both vision and language characteristics.
Figure 5: The comparison of two types of evaluation protocols. The blue boundaries are ground truth, and the yellow ones are prediction bounding boxes. This figure shows the detailed calculation of Normalized Scores (NS) and Generalized F-measure (GF). The red texts in the spotting results are incorrect.
...and 3 more figures

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

TL;DR

Abstract

TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)