Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets
Martin Kišš, Michal Hradiš
TL;DR
The paper tackles improving text recognition for Transformer-based models using masked self-supervised pre-training on large unlabeled data. It introduces two enhancements to masked label prediction: progressively increasing the masking probability during pre-training and computing the loss on both masked and non-masked patches, aided by a $4{,}096$-class K-Means discretization over features from a $50M$ line corpus. The encoder is then integrated into an encoder-decoder architecture and fine-tuned on four annotated datasets, achieving substantial CER reductions (up to $30\%$ relative) and demonstrating competitiveness with transfer learning while avoiding the need for additional labeled data. Across six model sizes and four downstream datasets, the approach shows robust gains, particularly on smaller datasets, and final results on Bentham, Bullinger, and CATMuS Medieval align closely with state-of-the-art benchmarks in several cases, underscoring the practical impact for handwritten text recognition.
Abstract
Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.
