Table of Contents
Fetching ...

RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition

Jinhui Zheng, Zhiquan Liu, Yain-Whar Si, Jianqing Li, Xinyuan Zhang, Xiaofan Li, Haozhi Huang, Xueyuan Gong

TL;DR

This paper addresses the challenge of handwritten paragraph text recognition (HPTR) by identifying a limitation in VAN's single-branch vertical attention module (VAM). It introduces Re-parameterizing Vertical Attention Fusion Module (RVAFM), which employs dual-parameter layers to enable a multi-branch structure during training and uses Re-parameterization Fusion (RF) to convert to a single-branch, faster inference module without accuracy loss. On the IAM paragraph-level dataset, the resulting RVAN system achieves $CER=4.44\%$ and $WER=14.37\%$, outperforming VAN and matching or surpassing several state-of-the-art HPTR methods in CER while delivering competitive WER, with improved inference speed. Overall, the approach offers a practical way to enhance learning capacity during training without sacrificing deployment efficiency, and it opens avenues for dynamic training strategies using dual-parameter configurations.

Abstract

Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.

RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition

TL;DR

This paper addresses the challenge of handwritten paragraph text recognition (HPTR) by identifying a limitation in VAN's single-branch vertical attention module (VAM). It introduces Re-parameterizing Vertical Attention Fusion Module (RVAFM), which employs dual-parameter layers to enable a multi-branch structure during training and uses Re-parameterization Fusion (RF) to convert to a single-branch, faster inference module without accuracy loss. On the IAM paragraph-level dataset, the resulting RVAN system achieves and , outperforming VAN and matching or surpassing several state-of-the-art HPTR methods in CER while delivering competitive WER, with improved inference speed. Overall, the approach offers a practical way to enhance learning capacity during training without sacrificing deployment efficiency, and it opens avenues for dynamic training strategies using dual-parameter configurations.

Abstract

Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.

Paper Structure

This paper contains 16 sections, 12 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The structure of RVAFM. (During the training process, the Training Stage RVAFM structure is used, which employs dual-parameter layers. After training is completed, we fuse the dual-parameter layers into single-parameter layers through structural re-parameterization, resulting in the Inference Stage RVAFM structure, which is used for inference.)
  • Figure 2: Visualization of the Additivity Distributive Property of Convolution and Matrix Multiplication
  • Figure 3: A sample image from the IAM paragraph-level dataset
  • Figure 4: The training curves of RVAN models with different $C_{u}$ values
  • Figure 5: The training curves of RVAN models with different NSL