Table of Contents
Fetching ...

General Detection-based Text Line Recognition

Raphael Baena, Syrine Kalleli, Mathieu Aubry

TL;DR

A general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters, with good performance on a large range of scripts, usually tackled with specialized approaches is introduced.

Abstract

We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters. Detection-based approaches have until now been largely discarded for HTR because reading characters separately is often challenging, and character-level annotation is difficult and expensive. We overcome these challenges thanks to three main insights: (i) synthetic pre-training with sufficiently diverse data enables learning reasonable character localization for any script; (ii) modern transformer-based detectors can jointly detect a large number of instances, and, if trained with an adequate masking strategy, leverage consistency between the different detections; (iii) once a pre-trained detection model with approximate character localization is available, it is possible to fine-tune it with line-level annotation on real data, even with a different alphabet. Our approach, dubbed DTLR, builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding, predicting character values one by one, while we treat a complete line in parallel. Remarkably, we demonstrate good performance on a large range of scripts, usually tackled with specialized approaches. In particular, we improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets. Our code and models are available at https://github.com/raphael-baena/DTLR.

General Detection-based Text Line Recognition

TL;DR

A general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters, with good performance on a large range of scripts, usually tackled with specialized approaches is introduced.

Abstract

We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR), with Latin, Chinese, or ciphered characters. Detection-based approaches have until now been largely discarded for HTR because reading characters separately is often challenging, and character-level annotation is difficult and expensive. We overcome these challenges thanks to three main insights: (i) synthetic pre-training with sufficiently diverse data enables learning reasonable character localization for any script; (ii) modern transformer-based detectors can jointly detect a large number of instances, and, if trained with an adequate masking strategy, leverage consistency between the different detections; (iii) once a pre-trained detection model with approximate character localization is available, it is possible to fine-tune it with line-level annotation on real data, even with a different alphabet. Our approach, dubbed DTLR, builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding, predicting character values one by one, while we treat a complete line in parallel. Remarkably, we demonstrate good performance on a large range of scripts, usually tackled with specialized approaches. In particular, we improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets. Our code and models are available at https://github.com/raphael-baena/DTLR.
Paper Structure (29 sections, 4 equations, 8 figures, 5 tables)

This paper contains 29 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Our model is general and can be used on diverse datasets, including challenging handwritten script, Chinese script and ciphers. From left to right and top to bottom we show results on Google1000 google1000, IAM IAM, READ READ2016, RIMES RIMES, CASIA casia, Cipher Cipherbaseline datasets.
  • Figure 2: Architecture. Our architecture is based on DINO-DETR dinoDETR. Given as input CNN image features, a transformer encoder predicts initial anchors and tokens, that are used by a transformer decoder to predict, for each token, a character bounding box and a probability for each character in the alphabet, including white space.
  • Figure 3: Samples from our synthetic datasets without (left) and with masking (right).
  • Figure 4: Failure cases. For the line(s) with the highest error on each dataset (Google1000 google1000, IAM IAM, RIMES RIMES, READ READ2016, CASIA casia, and Copiale Cipherbaseline) we show, our detections, the predicted text (P) and the ground-truth text (GT). Best seen in color.
  • Figure 5: Challenging Data. Qualitative examples from the READ dataset with slightly slanted lines, degraded characters and translucent paper. Spaces are omitted for better visualization.
  • ...and 3 more figures