Table of Contents
Fetching ...

Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss

Xuhua Ren, Hengcan Shi, Jin Li

TL;DR

A novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words, which outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.

Abstract

Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.

Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss

TL;DR

A novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words, which outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.

Abstract

Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
Paper Structure (17 sections, 3 equations, 3 figures, 4 tables)

This paper contains 17 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Our proposed method builds a pseudo label generation system using real images. For example, the model produces two pseudo labels, 'PARS' and 'PASIR' from the real 'PARIS' image. The "PARS" is an IV word while "PASIR" is an OOV word. Compared with traditional synthetic images, our pseudo labels are closer to real-world images.
  • Figure 2: The proposed Pseudo-OCR contains three parts: (a) A pseudo label generation module based on character detector, image inpainting and semantic checking to obtain pseudo labels; (b) a text recognition network with a ViT encoder and a permutation decoder to predict the word in the image; and (c) a quality-aware margin loss including a quality indicator to train the model. In the inference stage, only the text recognition network is used for predicting.
  • Figure 3: Qualitative results for samples taken from various test datasets related to the OOV problem. Both context-free methods, TRBA baek2019wrong and CRNN shi2016end, were unable to accurately predict certain cases, possibly due to the ambiguity involved. ABINet fang2022abinet++ encountered difficulties recognizing vertically-oriented and rotated text. PARSeq parseq also mis-recognized many characters. Compared with them, our method achieves the best performance.