Table of Contents
Fetching ...

NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

Chenyu Liu, Jia Pan, Jinshui Hu, Baocai Yin, Bing Yin, Mingjun Chen, Cong Liu, Jun Du, Qingfeng Liu

TL;DR

NAMER addresses the limitations of autoregressive handwritten mathematical expression recognition by introducing a bottom-up non-autoregressive pipeline. It combines a Visual Aware Tokenizer that tokenizes visible symbols and local relations with a Parallel Graph Decoder that revises tokens and predicts left/right connectivities in parallel to form a DAG. A bipartite-matching-based dynamic assignment provides training targets for VAT, while the PGD jointly learns token corrections and structural edges, enabling fast, end-to-end DAG decoding. Empirical results on CROHME and HME100K demonstrate state-of-the-art accuracy with substantial decoding speedups, highlighting NAMER's practical impact for real-time HMER tasks and complex symbol layouts.

Abstract

Recently, Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. Current methods typically approach HMER as an image-to-sequence generation task within an autoregressive (AR) encoder-decoder framework. However, these approaches suffer from several drawbacks: 1) a lack of overall language context, limiting information utilization beyond the current decoding step; 2) error accumulation during AR decoding; and 3) slow decoding speed. To tackle these problems, this paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph Decoder (PGD). Initially, the VAT tokenizes visible symbols and local relations at a coarse level. Subsequently, the PGD refines all tokens and establishes connectivities in parallel, leveraging comprehensive visual and linguistic contexts. Experiments on CROHME 2014/2016/2019 and HME100K datasets demonstrate that NAMER not only outperforms the current state-of-the-art (SOTA) methods on ExpRate by 1.93%/2.35%/1.49%/0.62%, but also achieves significant speedups of 13.7x and 6.7x faster in decoding time and overall FPS, proving the effectiveness and efficiency of NAMER.

NAMER: Non-Autoregressive Modeling for Handwritten Mathematical Expression Recognition

TL;DR

NAMER addresses the limitations of autoregressive handwritten mathematical expression recognition by introducing a bottom-up non-autoregressive pipeline. It combines a Visual Aware Tokenizer that tokenizes visible symbols and local relations with a Parallel Graph Decoder that revises tokens and predicts left/right connectivities in parallel to form a DAG. A bipartite-matching-based dynamic assignment provides training targets for VAT, while the PGD jointly learns token corrections and structural edges, enabling fast, end-to-end DAG decoding. Empirical results on CROHME and HME100K demonstrate state-of-the-art accuracy with substantial decoding speedups, highlighting NAMER's practical impact for real-time HMER tasks and complex symbol layouts.

Abstract

Recently, Handwritten Mathematical Expression Recognition (HMER) has gained considerable attention in pattern recognition for its diverse applications in document understanding. Current methods typically approach HMER as an image-to-sequence generation task within an autoregressive (AR) encoder-decoder framework. However, these approaches suffer from several drawbacks: 1) a lack of overall language context, limiting information utilization beyond the current decoding step; 2) error accumulation during AR decoding; and 3) slow decoding speed. To tackle these problems, this paper makes the first attempt to build a novel bottom-up Non-AutoRegressive Modeling approach for HMER, called NAMER. NAMER comprises a Visual Aware Tokenizer (VAT) and a Parallel Graph Decoder (PGD). Initially, the VAT tokenizes visible symbols and local relations at a coarse level. Subsequently, the PGD refines all tokens and establishes connectivities in parallel, leveraging comprehensive visual and linguistic contexts. Experiments on CROHME 2014/2016/2019 and HME100K datasets demonstrate that NAMER not only outperforms the current state-of-the-art (SOTA) methods on ExpRate by 1.93%/2.35%/1.49%/0.62%, but also achieves significant speedups of 13.7x and 6.7x faster in decoding time and overall FPS, proving the effectiveness and efficiency of NAMER.
Paper Structure (24 sections, 10 equations, 6 figures, 9 tables)

This paper contains 24 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison between existing AR based HMER methods and the proposed NAMER. All processes in NAMER follow a NAR manner.
  • Figure 2: Overall framework of the proposed method. NAMER consists of an encoder, a visual aware tokenizer, and a parallel graph decoder. For an input image, VAT firstly predicts all visible symbols and local relations at a coarse level, then PGD will revise these tokens and establish the connectivity between them. After PGD, a DAG is used for converting results to other format like LaTeX. All modules are in NAR manners.
  • Figure 3: Feasibility analysis of VAT. (a) Visual patterns of local relation tokens. "^" is in red color rect and "_" is in blue. (b) Coarse visual tokens on an HME image. Though all locations are imprecise, the three cases can all lead to a correct recognition finally.
  • Figure 4: Detailed structure of the proposed VAT module.
  • Figure 5: Detailed structure of the proposed PGD module.
  • ...and 1 more figures