Table of Contents
Fetching ...

Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets

Martin Kišš, Michal Hradiš

TL;DR

The paper tackles improving text recognition for Transformer-based models using masked self-supervised pre-training on large unlabeled data. It introduces two enhancements to masked label prediction: progressively increasing the masking probability during pre-training and computing the loss on both masked and non-masked patches, aided by a $4{,}096$-class K-Means discretization over features from a $50M$ line corpus. The encoder is then integrated into an encoder-decoder architecture and fine-tuned on four annotated datasets, achieving substantial CER reductions (up to $30\%$ relative) and demonstrating competitiveness with transfer learning while avoiding the need for additional labeled data. Across six model sizes and four downstream datasets, the approach shows robust gains, particularly on smaller datasets, and final results on Bentham, Bullinger, and CATMuS Medieval align closely with state-of-the-art benchmarks in several cases, underscoring the practical impact for handwritten text recognition.

Abstract

Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.

Masked Self-Supervised Pre-Training for Text Recognition Transformers on Large-Scale Datasets

TL;DR

The paper tackles improving text recognition for Transformer-based models using masked self-supervised pre-training on large unlabeled data. It introduces two enhancements to masked label prediction: progressively increasing the masking probability during pre-training and computing the loss on both masked and non-masked patches, aided by a -class K-Means discretization over features from a line corpus. The encoder is then integrated into an encoder-decoder architecture and fine-tuned on four annotated datasets, achieving substantial CER reductions (up to relative) and demonstrating competitiveness with transfer learning while avoiding the need for additional labeled data. Across six model sizes and four downstream datasets, the approach shows robust gains, particularly on smaller datasets, and final results on Bentham, Bullinger, and CATMuS Medieval align closely with state-of-the-art benchmarks in several cases, underscoring the practical impact for handwritten text recognition.

Abstract

Self-supervised learning has emerged as a powerful approach for leveraging large-scale unlabeled data to improve model performance in various domains. In this paper, we explore masked self-supervised pre-training for text recognition transformers. Specifically, we propose two modifications to the pre-training phase: progressively increasing the masking probability, and modifying the loss function to incorporate both masked and non-masked patches. We conduct extensive experiments using a dataset of 50M unlabeled text lines for pre-training and four differently sized annotated datasets for fine-tuning. Furthermore, we compare our pre-trained models against those trained with transfer learning, demonstrating the effectiveness of the self-supervised pre-training. In particular, pre-training consistently improves the character error rate of models, in some cases up to 30 % relatively. It is also on par with transfer learning but without relying on extra annotated text lines.

Paper Structure

This paper contains 24 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of our self-supervised pre-training. First, we pre-train encoder using fitted K-Means and masked label prediction. Then, we use the encoder in encoder-decoder model and we fine-tune it on an annotated dataset.
  • Figure 2: Details of the masked self-supervised pre-training. We process text lines using existing VGG-like model to extract visual features and we fit a K-Means on them (Figure \ref{['fig:method1']}). In the pre-training, we employ a VGG-like model and the fitted K-Means to generate discrete labels for a given text line. Random patches of the text line are masked, and the encoder is trained to predict labels of the masked parts. (Figure \ref{['fig:method2']}).
  • Figure 3: Examples of text lines from datasets.
  • Figure 4: Visualization of identically encoded patches. Each group of patches in each row represent patches with the same trigrams (label triplets).
  • Figure 5: Schema of our models consisting of an encoder, a decoder, and an adapter that transforms the encoder outputs for the decoder. The purple cicles represent 1D sine-based positional encoding.
  • ...and 1 more figures