RVAFM: Re-parameterizing Vertical Attention Fusion Module for Handwritten Paragraph Text Recognition
Jinhui Zheng, Zhiquan Liu, Yain-Whar Si, Jianqing Li, Xinyuan Zhang, Xiaofan Li, Haozhi Huang, Xueyuan Gong
TL;DR
This paper addresses the challenge of handwritten paragraph text recognition (HPTR) by identifying a limitation in VAN's single-branch vertical attention module (VAM). It introduces Re-parameterizing Vertical Attention Fusion Module (RVAFM), which employs dual-parameter layers to enable a multi-branch structure during training and uses Re-parameterization Fusion (RF) to convert to a single-branch, faster inference module without accuracy loss. On the IAM paragraph-level dataset, the resulting RVAN system achieves $CER=4.44\%$ and $WER=14.37\%$, outperforming VAN and matching or surpassing several state-of-the-art HPTR methods in CER while delivering competitive WER, with improved inference speed. Overall, the approach offers a practical way to enhance learning capacity during training without sacrificing deployment efficiency, and it opens avenues for dynamic training strategies using dual-parameter configurations.
Abstract
Handwritten Paragraph Text Recognition (HPTR) is a challenging task in Computer Vision, requiring the transformation of a paragraph text image, rich in handwritten text, into text encoding sequences. One of the most advanced models for this task is Vertical Attention Network (VAN), which utilizes a Vertical Attention Module (VAM) to implicitly segment paragraph text images into text lines, thereby reducing the difficulty of the recognition task. However, from a network structure perspective, VAM is a single-branch module, which is less effective in learning compared to multi-branch modules. In this paper, we propose a new module, named Re-parameterizing Vertical Attention Fusion Module (RVAFM), which incorporates structural re-parameterization techniques. RVAFM decouples the structure of the module during training and inference stages. During training, it uses a multi-branch structure for more effective learning, and during inference, it uses a single-branch structure for faster processing. The features learned by the multi-branch structure are fused into the single-branch structure through a special fusion method named Re-parameterization Fusion (RF) without any loss of information. As a result, we achieve a Character Error Rate (CER) of 4.44% and a Word Error Rate (WER) of 14.37% on the IAM paragraph-level test set. Additionally, the inference speed is slightly faster than VAN.
