Table of Contents
Fetching ...

Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition

Hanbo Cheng, Chenyu Liu, Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Jun Du

TL;DR

This work tackles the challenge of leveraging bidirectional context in handwritten mathematical expression recognition (HMER) with tree decoders. It introduces Mirror-Flipped Symbol Layout Tree (MF-SLT) to enable right-to-left supervision, Bidirectional Asynchronous Training (BAT) to fuse information from both decoding directions during training and inference, and Shared Language Modeling (SLM) to emphasize linguistic knowledge without adding inference parameters. The approach yields state-of-the-art results on CROHME 2014/2016/2019 and HME100K, outperforming prior bidirectional and language-informed methods while maintaining greedy inference efficiency. The findings show that linguistic information becomes more impactful with larger data, and the proposed methods generalize to both tree and string decoders, offering robust structure analysis and improved recognition under visual ambiguity.

Abstract

The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR. Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. However, existing methods fail to effectively utilize bidirectional context information during the inference stage. Furthermore, current bidirectional training methods are primarily designed for string decoders and cannot adequately generalize to tree decoders, which offer superior generalization capabilities and structural analysis capacity. In order to overcome these limitations, we propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure. Our method extends the bidirectional training strategy to the tree decoder, allowing for more effective training by leveraging bidirectional information. Additionally, we analyze the impact of the visual and linguistic perception of the HMER model separately and introduce the Shared Language Modeling (SLM) mechanism. Through the SLM, we enhance the model's robustness and generalization when dealing with visual ambiguity, particularly in scenarios with abundant training data. Our approach has been validated through extensive experiments, demonstrating its ability to achieve new state-of-the-art results on the CROHME 2014, 2016, and 2019 datasets, as well as the HME100K dataset. The code used in our experiments will be publicly available.

Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition

TL;DR

This work tackles the challenge of leveraging bidirectional context in handwritten mathematical expression recognition (HMER) with tree decoders. It introduces Mirror-Flipped Symbol Layout Tree (MF-SLT) to enable right-to-left supervision, Bidirectional Asynchronous Training (BAT) to fuse information from both decoding directions during training and inference, and Shared Language Modeling (SLM) to emphasize linguistic knowledge without adding inference parameters. The approach yields state-of-the-art results on CROHME 2014/2016/2019 and HME100K, outperforming prior bidirectional and language-informed methods while maintaining greedy inference efficiency. The findings show that linguistic information becomes more impactful with larger data, and the proposed methods generalize to both tree and string decoders, offering robust structure analysis and improved recognition under visual ambiguity.

Abstract

The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR. Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. However, existing methods fail to effectively utilize bidirectional context information during the inference stage. Furthermore, current bidirectional training methods are primarily designed for string decoders and cannot adequately generalize to tree decoders, which offer superior generalization capabilities and structural analysis capacity. In order to overcome these limitations, we propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure. Our method extends the bidirectional training strategy to the tree decoder, allowing for more effective training by leveraging bidirectional information. Additionally, we analyze the impact of the visual and linguistic perception of the HMER model separately and introduce the Shared Language Modeling (SLM) mechanism. Through the SLM, we enhance the model's robustness and generalization when dealing with visual ambiguity, particularly in scenarios with abundant training data. Our approach has been validated through extensive experiments, demonstrating its ability to achieve new state-of-the-art results on the CROHME 2014, 2016, and 2019 datasets, as well as the HME100K dataset. The code used in our experiments will be publicly available.
Paper Structure (20 sections, 11 equations, 14 figures, 7 tables)

This paper contains 20 sections, 11 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Three categories of visual confusion. (a): The "z" in the left and right instances have two different writing styles. (b): The "x" on the left is similar to the "n" and the "0 0" on the right is similar to the "$\infty$". (c): The background contains the noise, which is often easily recognized as "$\cdot$".
  • Figure 2: (a) Image and LaTeX sequence representation of the Mathematical Expression (ME). (b) The categories of the relationship between symbols. (c) The Symbol Layout Tree for the ME. (d) The tuple representation of the SLT.
  • Figure 3: The Bidirectional Asynchronous Training (BAT) strategy comprises a pipeline based on the encoder-decoder structure. The encoder stage generates the feature map $\boldsymbol{A}$ from the input image. The decoder stage includes a pair of R2L and L2R decoders. More specifically, the R2L decoder generates the MF-SLT, and the L2R decoder uses the hidden state produced by the R2L decoder to further predict the SLT.
  • Figure 4: In the tree structure labeling, the parent-child relationship is one-to-many, and there are multiple terminals. The character in blue represents the parent node, while the character in green represents the child.
  • Figure 5: The transformation of MF-SLT from the original SLT as well as the L2R and R2L LaTeX sequence. The relations and characters contained in "main path" are boldened and shaded.
  • ...and 9 more figures