Table of Contents
Fetching ...

Multi-Modal Character Localization and Extraction for Chinese Text Recognition

Qilong Li, Chongsheng Zhang

Abstract

Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.

Multi-Modal Character Localization and Extraction for Chinese Text Recognition

Abstract

Scene text recognition (STR) methods have demonstrated their excellent capability in English text images. However, due to the complex inner structures of Chinese and the extensive character categories, it poses challenges for recognizing Chinese text in images. Recently, studies have shown that the methods designed for English text recognition encounter an accuracy bottleneck when recognizing Chinese text images. This raises the question: Is it appropriate to apply the model developed for English to the Chinese STR task? To explore this issue, we propose a novel method named LER, which explicitly decouples each character and independently recognizes characters while taking into account the complex inner structures of Chinese. LER consists of three modules: Localization, Extraction, and Recognition. Firstly, the localization module utilizes multimodal information to determine the character's position precisely. Then, the extraction module dissociates all characters in parallel. Finally, the recognition module considers the unique inner structures of Chinese to provide the text prediction results. Extensive experiments conducted on large-scale Chinese benchmarks indicate that our method significantly outperforms existing methods. Furthermore, extensive experiments conducted on six English benchmarks and the Union14M benchmark show impressive results in English text recognition by LER. Code is available at https://github.com/Pandarenlql/LER.
Paper Structure (21 sections, 7 equations, 10 figures, 9 tables)

This paper contains 21 sections, 7 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: (a) Autoregressive decoder and parallel decoder. (b) The forward propagation of our LER. $<s>$, $</s>$, $<p>$, $C$, and $P$ denote the start of the sentence, the end of the sentence, pad, character, and position, respectively. $T$ denotes the text query proposed in this paper, $A$ denotes the visual feature obtained by localization module, $F$ denotes the independent character feature. Capital letters (e.g., $T$) denote features, while lowercase letters (e.g., $t$) represent vectors within the features. Subscripts indicate the index of vectors (e.g., $t_i$). The blue flow denotes the visual feature obtained by the image encoder.
  • Figure 2: (a) The spatial structure of Chinese characters. (b) Illustration of character decomposition.
  • Figure 3: The proposed LER framework and IDS decoder, $N$ and $M$ denote the number of blocks in the LER network and the number of char cutter blocks in the extraction module, respectively.
  • Figure 4: Multimodal Localization Block. (a) The CLIP's text feature. (b) The structure of the Multimodal Localization Block (MLB). $Vis$, $T$, $P$, $A$, and $C$ denote the visual feature, the text query, the position embedding, the localization feature, and character, respectively. Superscript denotes the index of MLB in the localization module.
  • Figure 5: Visualization of the character localization.
  • ...and 5 more figures