Table of Contents
Fetching ...

Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

Denis Coquenet

TL;DR

Meta-DAN addresses the slow inference of end-to-end page-level handwritten text recognition by introducing windowed queries and multi-token predictions, allowing multiple tokens to be predicted per decoding step and leveraging near-future context. The framework unifies MT-DAN and W-DAN into Meta-DAN, controlled by two hyperparameters, and demonstrates state-of-the-art average CER across 10 diverse datasets without external data or language models. A dynamic prediction policy further balances speed and accuracy, and extensive multilingual experiments show the value of shared representations across related languages. The approach yields significant inference-time gains while improving language modeling within the decoder, offering a versatile, scalable solution for fast, accurate page-level HTR. Overall, Meta-DAN provides a flexible, robust decoding paradigm with strong empirical performance and broad applicability to autoregressive transformer-based recognition systems.

Abstract

Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.

Meta-DAN: towards an efficient prediction strategy for page-level handwritten text recognition

TL;DR

Meta-DAN addresses the slow inference of end-to-end page-level handwritten text recognition by introducing windowed queries and multi-token predictions, allowing multiple tokens to be predicted per decoding step and leveraging near-future context. The framework unifies MT-DAN and W-DAN into Meta-DAN, controlled by two hyperparameters, and demonstrates state-of-the-art average CER across 10 diverse datasets without external data or language models. A dynamic prediction policy further balances speed and accuracy, and extensive multilingual experiments show the value of shared representations across related languages. The approach yields significant inference-time gains while improving language modeling within the decoder, offering a versatile, scalable solution for fast, accurate page-level HTR. Overall, Meta-DAN provides a flexible, robust decoding paradigm with strong empirical performance and broad applicability to autoregressive transformer-based recognition systems.

Abstract

Recent advances in text recognition led to a paradigm shift for page-level recognition, from multi-step segmentation-based approaches to end-to-end attention-based ones. However, the naïve character-level autoregressive decoding process results in long prediction times: it requires several seconds to process a single page image on a modern GPU. We propose the Meta Document Attention Network (Meta-DAN) as a novel decoding strategy to reduce the prediction time while enabling a better context modeling. It relies on two main components: windowed queries, to process several transformer queries altogether, enlarging the context modeling with near future; and multi-token predictions, whose goal is to predict several tokens per query instead of only the next one. We evaluate the proposed approach on 10 full-page handwritten datasets and demonstrate state-of-the-art results on average in terms of character error rate. Source code and weights of trained models are available at https://github.com/FactoDeepLearning/meta_dan.

Paper Structure

This paper contains 30 sections, 12 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: DAN architecture. Encoding stage: 2D features $\boldsymbol{f}$ are extracted from the input image through a Fully Convolutional Network (FCN). Features are enhanced with 2D positional encoding and flattened for transformer needs. Decoding stage: at each iteration $t$, a query sequence $\boldsymbol{q}^t$ is generated from the previous predictions $\hat{\boldsymbol{y}}^t$ and associated to the image features through transformer's attention mechanisms to compute the new prediction $\hat{\boldsymbol{y}}_t$.
  • Figure 2: Comparison of prediction strategies for the target sequence "A\\nsimple\\nexample.". Approaches can be compared under three aspects: 1) the number of predictions per query (single for DAN, Faster DAN and W-DAN); 2) the number of queries processed at once per decoding iteration (single for DAN and MT-DAN); 3) the available context (beginning of each line for Faster DAN; full past for the others). The Meta-DAN groups together the advantages for each of these points: it processes multiple queries at once, each one leading to several token predictions, and benefits from full past context.
  • Figure 3: Comparison of attention context for the target sequence "A\\nsimple\\nexample.". The blue color indicates which tokens can be attended (column) when processing a given token (line). Grey cells corresponds to ignored tokens (i.e., tokens that have not yet been predicted). For the Faster DAN, the red color corresponds to the first pass, and the blue color to the second pass.
  • Figure 4: Evolution of the prediction time with respect to the window size $w$ for the W-DAN and to the number of token heads $m$ for the MT-DAN on the test set of IAM and BRESSAY datasets. The prediction time dramatically decreases as $m$ or $w$ increases at first, and then less sharply as the computation load of the encoder becomes more and more significant on the overall inference process.
  • Figure 5: Impact of the CER threshold for dynamic decoding of the MT-DAN on the test set of IAM and BRESSAY.
  • ...and 2 more figures