Table of Contents
Fetching ...

MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition

Felix M. Schmitt-Koopmann, Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, Alireza Darvishy

TL;DR

This work tackles the instability in printed mathematical expression recognition caused by non-canonical LaTeX ground-truths and limited font variation. It presents a data-centric normalization pipeline that canonicalizes LaTeX MEs, together with a multi-font dataset (im2latexv2) and a real-world test set (realFormula) to improve generalization. The authors introduce MathNet, a convolutional vision transformer-based MER model, and demonstrate state-of-the-art performance across four datasets, with substantial gains largely driven by normalization and broader font coverage. They also analyze limitations such as array usage and math font handling, and propose future directions to broaden font support and integrate with PDF-captioning workflows for better accessibility and searchability.

Abstract

Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.

MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition

TL;DR

This work tackles the instability in printed mathematical expression recognition caused by non-canonical LaTeX ground-truths and limited font variation. It presents a data-centric normalization pipeline that canonicalizes LaTeX MEs, together with a multi-font dataset (im2latexv2) and a real-world test set (realFormula) to improve generalization. The authors introduce MathNet, a convolutional vision transformer-based MER model, and demonstrate state-of-the-art performance across four datasets, with substantial gains largely driven by normalization and broader font coverage. They also analyze limitations such as array usage and math font handling, and propose future directions to broaden font support and integrate with PDF-captioning workflows for better accessibility and searchability.

Abstract

Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.
Paper Structure (30 sections, 2 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 30 sections, 2 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example of an ME image that can be produced with more than one LaTeX code. While the two presented LaTeX codes are quite different ($22.2$% Edit score), they create the same image.
  • Figure 2: The ME: A = \\ mathcal{A} = \\ mathbb{A} = \\ boldsymbol{A} generated with the $16$ fonts, which can render all three basic mathematical fonts.
  • Figure 3: Overview of all 59 fonts in the im2latexv2 dataset.
  • Figure 4: Overview of our MER model, called MathNet. The CvT consists of $3$ layers, which are a combination of an embedding layer and a transformer block. The encoded image is decoded with a decoder transformer and a classifier layer.
  • Figure 5: The plot shows the average edit score per sequence length for the different models and the im2latex-100k dataset. The x-axis shows the number of tokens in the ME with a bin width of 3. The y-axis shows the average Edit score of each bin. A perfect prediction has a Edit score of 1.
  • ...and 1 more figures