VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Humen Zhong; Zhibo Yang; Zhaohai Li; Peng Wang; Jun Tang; Wenqing Cheng; Cong Yao

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Humen Zhong, Zhibo Yang, Zhaohai Li, Peng Wang, Jun Tang, Wenqing Cheng, Cong Yao

TL;DR

This work proposes an innovative scene text recognition approach, named VL-Reader, which aims at simultaneously modeling visual and linguistic information and designs a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction.

Abstract

Text recognition is an inherent integration of vision and language, encompassing the visual texture in stroke patterns and the semantic context among the character sequences. Towards advanced text recognition, there are three key challenges: (1) an encoder capable of representing the visual and semantic distributions; (2) a decoder that ensures the alignment between vision and semantics; and (3) consistency in the framework during pre-training, if it exists, and fine-tuning. Inspired by masked autoencoding, a successful pre-training strategy in both vision and language, we propose an innovative scene text recognition approach, named VL-Reader. The novelty of the VL-Reader lies in the pervasive interplay between vision and language throughout the entire process. Concretely, we first introduce a Masked Visual-Linguistic Reconstruction (MVLR) objective, which aims at simultaneously modeling visual and linguistic information. Then, we design a Masked Visual-Linguistic Decoder (MVLD) to further leverage masked vision-language context and achieve bi-modal feature interaction. The architecture of VL-Reader maintains consistency from pre-training to fine-tuning. In the pre-training stage, VL-Reader reconstructs both masked visual and text tokens, while in the fine-tuning stage, the network degrades to reconstruct all characters from an image without any masked regions. VL-reader achieves an average accuracy of 97.1% on six typical datasets, surpassing the SOTA by 1.1%. The improvement was even more significant on challenging datasets. The results demonstrate that vision and language reconstructor can serve as an effective scene text recognizer.

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

TL;DR

Abstract

Paper Structure (21 sections, 7 equations, 7 figures, 4 tables)

This paper contains 21 sections, 7 equations, 7 figures, 4 tables.

Introduction
Related Work
Vision-dominated Methods
Language-aware Methods
Methodology
Overall Architecture
Masked Vision-Language Reconstruction
Training and Inference
Experiments
Datasets and Implementation Details
Comparisons with State-of-the-Arts
Standard Benchmarks.
More Challenging Benchmarks.
Ablation Study
Effectiveness of MVLR
...and 6 more sections

Figures (7)

Figure 1: (a) Models with vision-dominated decoders mainly rely on visual context and are incapable of handling low-quality images. (b) Models with language-dominated decoders mainly rely on linguistic context and may generate semantically correct but visually incorrect predictions.
Figure 2: The comparison between different reconstruction pipelines. (a) Visual Reconstruction follows the pipeline of MAE and its decoder will be discarded in the recognizer (dashed gray box). (b) Linguistic Reconstruction utilizes a standalone language model for linguistic refinement after visual results. (c) Our Visual-Linguistic Reconstruction reconstructs both visual and linguistic information and can inherit the entire architecture.
Figure 3: Overall architecture of VL-Reader and its detailed structure for the masked visual-linguistic decoder. In the first training phase, VL-Reader is trained under the supervision of MVLR. In the second phase, VL-Reader disables the visual reconstruction task and focuses on the text recognition task only. Black patches indicate masked visual patches and "#" indicates a masked language token.
Figure 4: The generation process of query-text attention mask $m_{q,l}$. In the second training phase, the attention mask for masking out characters will be a matrix completely filled with "1"s.
Figure 5: Analysis of (a) different visual masking ratios $r_{v}$ and (b) different linguistic masking ratios $r_l$. All other parameters are fixed during training. VL-Reader reaches the highest average accuracy on six standard benchmarks around $r_v = 0.75$ and $r_l = 0.2$.
...and 2 more figures

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

TL;DR

Abstract

VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer

Authors

TL;DR

Abstract

Table of Contents

Figures (7)