Table of Contents
Fetching ...

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

Tongkun Guan, Chengyu Lin, Wei Shen, Xiaokang Yang

TL;DR

PosFormer addresses handwritten mathematical expression recognition by explicitly modeling symbol positions to capture complex hierarchies beyond LaTeX syntax. It introduces a position forest that encodes LaTeX substructures into trees and a joint training objective with a position-recognition task. An Implicit Attention Correction module further refines decoder attention to improve symbol-level feature learning. Across CROHME, M2E, and MNE benchmarks, PosFormer achieves state-of-the-art results with notable gains while maintaining no extra latency, and the approach generalizes to other architectures.

Abstract

Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.

PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer

TL;DR

PosFormer addresses handwritten mathematical expression recognition by explicitly modeling symbol positions to capture complex hierarchies beyond LaTeX syntax. It introduces a position forest that encodes LaTeX substructures into trees and a joint training objective with a position-recognition task. An Implicit Attention Correction module further refines decoder attention to improve symbol-level feature learning. Across CROHME, M2E, and MNE benchmarks, PosFormer achieves state-of-the-art results with notable gains while maintaining no extra latency, and the approach generalizes to other architectures.

Abstract

Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.
Paper Structure (21 sections, 11 equations, 7 figures, 9 tables, 1 algorithm)

This paper contains 21 sections, 11 equations, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overall structure of our proposed PosFormer, which jointly optimizes two tasks: expression recognition and position recognition. The former employs parallel linear prediction for symbol recognition; the latter encodes the LaTeX sequence as a position forest structure and decodes the nested levels and relative positions of each symbol to assist in position-aware symbol-level feature representation learning.
  • Figure 1: Some visualization examples.
  • Figure 2: Four substructure types of mathematical expressions, including superscript-subscript, fraction, radical, and special operator structures.
  • Figure 3: Illustration of the position forest coding process, which can be simply described as sequence $\rightarrow$ substructure $\rightarrow$ tree $\rightarrow$ position forest. Specifically, we encode each symbol as a position identifier to denote its relative spatial position (e.g., "MLLR").
  • Figure 4: Illustration of structure symbols which are used to describe the position and hierarchical relationships between symbols in LaTeX.
  • ...and 2 more figures