Table of Contents
Fetching ...

Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition

Yutian Liu, Wenjun Ke, Jianguo Wei

TL;DR

This paper tackles handwritten math expression recognition by introducing an attention guidance mechanism to address a context leakage problem where attention may fix on regions intended for future decoding. It proposes two complementary strategies: self-guidance, which enforces cross-head consensus, and neighbor-guidance, which reuses the previous decoding step’s attention to guide current steps, integrated into a DenseNet–Transformer HMER pipeline with an ARM refinement. The approach yields state-of-the-art ExpRate on CROHME benchmarks (approximately $0.6075$, $0.6181$, $0.6330$ for CROHME 2014/2016/2019) and demonstrates consistent gains in ablations, showing improved alignment and reduced under-parsing. The findings enhance decoding reliability and suggest broader applicability to other attention-based sequence tasks requiring dynamic alignment.

Abstract

Handwritten mathematical expression recognition (HMER) is challenging in image-to-text tasks due to the complex layouts of mathematical expressions and suffers from problems including over-parsing and under-parsing. To solve these, previous HMER methods improve the attention mechanism by utilizing historical alignment information. However, this approach has limitations in addressing under-parsing since it cannot correct the erroneous attention on image areas that should be parsed at subsequent decoding steps. This faulty attention causes the attention module to incorporate future context into the current decoding step, thereby confusing the alignment process. To address this issue, we propose an attention guidance mechanism to explicitly suppress attention weights in irrelevant areas and enhance the appropriate ones, thereby inhibiting access to information outside the intended context. Depending on the type of attention guidance, we devise two complementary approaches to refine attention weights: self-guidance that coordinates attention of multiple heads and neighbor-guidance that integrates attention from adjacent time steps. Experiments show that our method outperforms existing state-of-the-art methods, achieving expression recognition rates of 60.75% / 61.81% / 63.30% on the CROHME 2014/ 2016/ 2019 datasets.

Attention Guidance Mechanism for Handwritten Mathematical Expression Recognition

TL;DR

This paper tackles handwritten math expression recognition by introducing an attention guidance mechanism to address a context leakage problem where attention may fix on regions intended for future decoding. It proposes two complementary strategies: self-guidance, which enforces cross-head consensus, and neighbor-guidance, which reuses the previous decoding step’s attention to guide current steps, integrated into a DenseNet–Transformer HMER pipeline with an ARM refinement. The approach yields state-of-the-art ExpRate on CROHME benchmarks (approximately , , for CROHME 2014/2016/2019) and demonstrates consistent gains in ablations, showing improved alignment and reduced under-parsing. The findings enhance decoding reliability and suggest broader applicability to other attention-based sequence tasks requiring dynamic alignment.

Abstract

Handwritten mathematical expression recognition (HMER) is challenging in image-to-text tasks due to the complex layouts of mathematical expressions and suffers from problems including over-parsing and under-parsing. To solve these, previous HMER methods improve the attention mechanism by utilizing historical alignment information. However, this approach has limitations in addressing under-parsing since it cannot correct the erroneous attention on image areas that should be parsed at subsequent decoding steps. This faulty attention causes the attention module to incorporate future context into the current decoding step, thereby confusing the alignment process. To address this issue, we propose an attention guidance mechanism to explicitly suppress attention weights in irrelevant areas and enhance the appropriate ones, thereby inhibiting access to information outside the intended context. Depending on the type of attention guidance, we devise two complementary approaches to refine attention weights: self-guidance that coordinates attention of multiple heads and neighbor-guidance that integrates attention from adjacent time steps. Experiments show that our method outperforms existing state-of-the-art methods, achieving expression recognition rates of 60.75% / 61.81% / 63.30% on the CROHME 2014/ 2016/ 2019 datasets.
Paper Structure (18 sections, 18 equations, 8 figures, 3 tables)

This paper contains 18 sections, 18 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of the context leakage phenomenon observed in a state-of-the-art coverage-based system CoMER. The rows are a series of attention maps attending to visible symbols. We skip attention maps of structural symbols ("{", "}", "" and "_") for simplicity. The number in the upper left corner of each row denotes the $i$-th visible symbol in the generated sequence.
  • Figure 2: Alignment process of a Transformer decoder with $L$ layers. As we can see, the output layer attends to the image region of the symbol to be generated, while the middle layers focus on its previously decoded neighbor. Note that the three "$x$" are attended simultaneously in layer $L-2$ at decoding step 2, which causes the context leakage phenomenon.
  • Figure 3: Handwritten mathematical expression recognition with the proposed attention guidance mechanism. The model is based on CoMER CoMER, which consists of a CNN encoder and a bidirectionally trained Transformer decoder. The bidirectional decoder receives input and produces output in two directions simultaneously. Attention guidance is applied to the cross-attention modules of the decoder, where attention weights are refined via the guidance map.
  • Figure 4: Attention guidance mechanism. $T$ and $L$ denote the length of the query and the key, respectively. Attention guidance is a set of attention maps that can be derived from multiple sources. Raw correlations are refined based on the guidance map obtained by seeking consensus from the attention guidance.
  • Figure 5: Structures of the conventional attention and the proposed self-guidance module. $\odot$ denotes the element-wise multiplication and $\oplus$ denotes the element-wise addition.
  • ...and 3 more figures