Table of Contents
Fetching ...

Masked and Permuted Implicit Context Learning for Scene Text Recognition

Xiaomeng Yang, Zhi Qiao, Jin Wei, Dongbao Yang, Yu Zhou

TL;DR

This work introduces MPSTR, a unified decoder that blends permuted language modeling (PLM) and masked language modeling (MLM) for scene text recognition (STR). It uses a ViT-based encoder with a learnable [len] token to predict word length, and a Masked and Permuted Decoder (MP-decoder) that attends to both word-context and mask-context via two MHCA modules, guided by a set of permutations and mask tokens. The training objective combines length prediction and recognition losses, with a perturbation strategy to robustify length handling; ablations demonstrate the necessity of both length supervision and mask tokens, and word-level length prediction yields the best results. Empirically, MPSTR achieves state-of-the-art performance on Union14M-Benchmark and competitive results on standard STR benchmarks, while reducing inference latency relative to some baselines and providing a cleansed benchmark release to ensure fair evaluation.

Abstract

Scene Text Recognition (STR) is difficult because of the variations in text styles, shapes, and backgrounds. Though the integration of linguistic information enhances models' performance, existing methods based on either permuted language modeling (PLM) or masked language modeling (MLM) have their pitfalls. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. Addressing these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process and replace the undetermined characters with mask tokens. Besides, perturbation training is employed to train a more robust model against potential length prediction errors. Our empirical evaluations demonstrate the performance of our model. It not only achieves superior performance on the common benchmarks but also achieves a substantial improvement of $9.1\%$ on the more challenging Union14M-Benchmark.

Masked and Permuted Implicit Context Learning for Scene Text Recognition

TL;DR

This work introduces MPSTR, a unified decoder that blends permuted language modeling (PLM) and masked language modeling (MLM) for scene text recognition (STR). It uses a ViT-based encoder with a learnable [len] token to predict word length, and a Masked and Permuted Decoder (MP-decoder) that attends to both word-context and mask-context via two MHCA modules, guided by a set of permutations and mask tokens. The training objective combines length prediction and recognition losses, with a perturbation strategy to robustify length handling; ablations demonstrate the necessity of both length supervision and mask tokens, and word-level length prediction yields the best results. Empirically, MPSTR achieves state-of-the-art performance on Union14M-Benchmark and competitive results on standard STR benchmarks, while reducing inference latency relative to some baselines and providing a cleansed benchmark release to ensure fair evaluation.

Abstract

Scene Text Recognition (STR) is difficult because of the variations in text styles, shapes, and backgrounds. Though the integration of linguistic information enhances models' performance, existing methods based on either permuted language modeling (PLM) or masked language modeling (MLM) have their pitfalls. PLM's autoregressive decoding lacks foresight into subsequent characters, while MLM overlooks inter-character dependencies. Addressing these problems, we propose a masked and permuted implicit context learning network for STR, which unifies PLM and MLM within a single decoder, inheriting the advantages of both approaches. We utilize the training procedure of PLM, and to integrate MLM, we incorporate word length information into the decoding process and replace the undetermined characters with mask tokens. Besides, perturbation training is employed to train a more robust model against potential length prediction errors. Our empirical evaluations demonstrate the performance of our model. It not only achieves superior performance on the common benchmarks but also achieves a substantial improvement of on the more challenging Union14M-Benchmark.
Paper Structure (23 sections, 5 equations, 3 figures, 11 tables)

This paper contains 23 sections, 5 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Decoding procedures of PLM, MLM, and our unified language modeling. "-" in the figure represents the mask token used in decoding. The lack of global information and inadequate linguistic learning leads to misrecognition of PLM and MLM based methods.
  • Figure 2: Architecture of the proposed method. $[B]$, $[E]$, $[P]$ and $[M]$ stands for the beginning-of-sequence, end-of-sequence, padding and mask tokens, respectively. The ViT-based encoder provides the text length using the $[len]$ token. Then, the predicted length number of mask tokens is appended. After K permutation operations, the masked and permuted text is input to the decoder for the corresponding prediction.
  • Figure 3: Examples of label noise in benchmark datasets are presented. Characters that are mislabeled or missing are highlighted in red (top), while the corrected labels are provided below.