Table of Contents
Fetching ...

HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis

Mohammed Hamdan, Abderrahmane Rahiche, Mohamed Cheriet

TL;DR

HAND introduces a unified, end-to-end framework for handwritten document recognition and layout analysis that scales from line-level to triple-column pages. The architecture couples a sophisticated dual-path encoder with a transformer-based decoder featuring memory-augmented and sparse attention, augmented by a Multi-Scale Adaptive Processing framework and curriculum learning to handle diverse historical documents efficiently. A domain-adaptive mT5 post-processing stage further refines outputs, achieving state-of-the-art results on the READ 2016 dataset across text and layout metrics while maintaining a compact model size. The work demonstrates significant improvements in recognition accuracy and layout understanding, with strong implications for multi-page historical document processing and digital archival workflows.

Abstract

Handwritten document recognition (HDR) is one of the most challenging tasks in the field of computer vision, due to the various writing styles and complex layouts inherent in handwritten texts. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis, and struggled to integrate the two processes effectively. This paper introduces HAND (Hierarchical Attention Network for Multi-Scale Document), a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks. Our model's key components include an advanced convolutional encoder integrating Gated Depth-wise Separable and Octave Convolutions for robust feature extraction, a Multi-Scale Adaptive Processing (MSAP) framework that dynamically adjusts to document complexity and a hierarchical attention decoder with memory-augmented and sparse attention mechanisms. These components enable our model to scale effectively from single-line to triple-column pages while maintaining computational efficiency. Additionally, HAND adopts curriculum learning across five complexity levels. To improve the recognition accuracy of complex ancient manuscripts, we fine-tune and integrate a Domain-Adaptive Pre-trained mT5 model for post-processing refinement. Extensive evaluations on the READ 2016 dataset demonstrate the superior performance of HAND, achieving up to 59.8% reduction in CER for line-level recognition and 31.2% for page-level recognition compared to state-of-the-art methods. The model also maintains a compact size of 5.60M parameters while establishing new benchmarks in both text recognition and layout analysis. Source code and pre-trained models are available at : https://github.com/MHHamdan/HAND.

HAND: Hierarchical Attention Network for Multi-Scale Handwritten Document Recognition and Layout Analysis

TL;DR

HAND introduces a unified, end-to-end framework for handwritten document recognition and layout analysis that scales from line-level to triple-column pages. The architecture couples a sophisticated dual-path encoder with a transformer-based decoder featuring memory-augmented and sparse attention, augmented by a Multi-Scale Adaptive Processing framework and curriculum learning to handle diverse historical documents efficiently. A domain-adaptive mT5 post-processing stage further refines outputs, achieving state-of-the-art results on the READ 2016 dataset across text and layout metrics while maintaining a compact model size. The work demonstrates significant improvements in recognition accuracy and layout understanding, with strong implications for multi-page historical document processing and digital archival workflows.

Abstract

Handwritten document recognition (HDR) is one of the most challenging tasks in the field of computer vision, due to the various writing styles and complex layouts inherent in handwritten texts. Traditionally, this problem has been approached as two separate tasks, handwritten text recognition and layout analysis, and struggled to integrate the two processes effectively. This paper introduces HAND (Hierarchical Attention Network for Multi-Scale Document), a novel end-to-end and segmentation-free architecture for simultaneous text recognition and layout analysis tasks. Our model's key components include an advanced convolutional encoder integrating Gated Depth-wise Separable and Octave Convolutions for robust feature extraction, a Multi-Scale Adaptive Processing (MSAP) framework that dynamically adjusts to document complexity and a hierarchical attention decoder with memory-augmented and sparse attention mechanisms. These components enable our model to scale effectively from single-line to triple-column pages while maintaining computational efficiency. Additionally, HAND adopts curriculum learning across five complexity levels. To improve the recognition accuracy of complex ancient manuscripts, we fine-tune and integrate a Domain-Adaptive Pre-trained mT5 model for post-processing refinement. Extensive evaluations on the READ 2016 dataset demonstrate the superior performance of HAND, achieving up to 59.8% reduction in CER for line-level recognition and 31.2% for page-level recognition compared to state-of-the-art methods. The model also maintains a compact size of 5.60M parameters while establishing new benchmarks in both text recognition and layout analysis. Source code and pre-trained models are available at : https://github.com/MHHamdan/HAND.

Paper Structure

This paper contains 47 sections, 52 equations, 7 figures, 8 tables, 4 algorithms.

Figures (7)

  • Figure 2: Document recognition complexity across multiple scales: from line-level to triple-page documents. (a) Line level, (b) Paragraph level, (c) Single-page document, (d) Double-page document, (e) Triple-page document. Images are from the READ 2016 dataset.
  • Figure 3: Overview of the HAND Architecture: The HAND integrates convolutional layers as encoder for spatial feature extraction and a transformer decoder layers as a decoder for sequential prediction.
  • Figure 4: An exemple of handwritten text recognition using HAND, with (c) and without post-processing (d).
  • Figure 5: Extended hierarchical structure of a triple-column document. Arrows indicate relationships among nodes: top-bottom indicates the hierarchical structure while left-right indicates the reading order.
  • Figure : (a) document hierarchical structures
  • ...and 2 more figures