Table of Contents
Fetching ...

PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

Hongen Liu, Cheng Cui, Yuning Du, Yi Liu, Gang Pan

TL;DR

PP-FormulaNet tackles the challenge of accurate and efficient formula recognition in real-world documents by introducing two backbones (PP-FormulaNet-L for high accuracy and PP-FormulaNet-S for high efficiency), a sizeable arXiv-based formula dataset built via a dedicated mining system, and techniques including weight interpolation, knowledge distillation, and multi-token parallel prediction to balance precision and throughput. Empirical results show that PP-FormulaNet-L outperforms a leading baseline by around 6% in Avg-BLEU, while PP-FormulaNet-S delivers up to 16x faster GPU inference at batch size 15, demonstrating a favorable accuracy-speed trade-off for practical document processing. The work also presents a robust formula mining and normalization pipeline to handle complex and user-defined LaTeX commands, enabling broader applicability across diverse document collections. Code and pretrained models are publicly available at PaddleOCR and PaddleX, promoting adoption in real-world OCR workflows.

Abstract

Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(https://github.com/PaddlePaddle/PaddleOCR) and PaddleX(https://github.com/PaddlePaddle/PaddleX).

PP-FormulaNet: Bridging Accuracy and Efficiency in Advanced Formula Recognition

TL;DR

PP-FormulaNet tackles the challenge of accurate and efficient formula recognition in real-world documents by introducing two backbones (PP-FormulaNet-L for high accuracy and PP-FormulaNet-S for high efficiency), a sizeable arXiv-based formula dataset built via a dedicated mining system, and techniques including weight interpolation, knowledge distillation, and multi-token parallel prediction to balance precision and throughput. Empirical results show that PP-FormulaNet-L outperforms a leading baseline by around 6% in Avg-BLEU, while PP-FormulaNet-S delivers up to 16x faster GPU inference at batch size 15, demonstrating a favorable accuracy-speed trade-off for practical document processing. The work also presents a robust formula mining and normalization pipeline to handle complex and user-defined LaTeX commands, enabling broader applicability across diverse document collections. Code and pretrained models are publicly available at PaddleOCR and PaddleX, promoting adoption in real-world OCR workflows.

Abstract

Formula recognition is an important task in document intelligence. It involves converting mathematical expressions from document images into structured symbolic formats that computers can easily work with. LaTeX is the most common format used for this purpose. In this work, we present PP-FormulaNet, a state-of-the-art formula recognition model that excels in both accuracy and efficiency. To meet the diverse needs of applications, we have developed two specialized models: PP-FormulaNet-L, tailored for high-accuracy scenarios, and PP-FormulaNet-S, optimized for high-efficiency contexts. Our extensive evaluations reveal that PP-FormulaNet-L attains accuracy levels that surpass those of prominent models such as UniMERNet by a significant 6%. Conversely, PP-FormulaNet-S operates at speeds that are over 16 times faster. These advancements facilitate seamless integration of PP-FormulaNet into a broad spectrum of document processing environments that involve intricate mathematical formulas. Furthermore, we introduce a Formula Mining System, which is capable of extracting a vast amount of high-quality formula data. This system further enhances the robustness and applicability of our formula recognition model. Code and models are publicly available at PaddleOCR(https://github.com/PaddlePaddle/PaddleOCR) and PaddleX(https://github.com/PaddlePaddle/PaddleX).

Paper Structure

This paper contains 21 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Definition of Formula Recognition, and Comparison of the Avg-BLEU and FPS of Different Models. GPU Inference Time tested on 32G Tesla V100 with batch size of 15.
  • Figure 2: The overall architecture of PP-FormulaNet. Initially, formula-image pairs are extracted from arXiv paper sources through a formula mining system, resulting in a dataset comprising 4 million formulas. PP-FormulaNet consists of two variants: (1) PP-FormulaNet-L: A vision encoder based on the Vary-VIT-B backbone from GOT-2.0, combined with a 512-dimensional MBart Decoder. (2) PP-FormulaNet-S: A vision encoder based on the distilled PP-HGNetV2-B4, combined with a 384-dimensional MBart Decoder. To maintain the formula representation capability of pre-trained weights, weight interpolation is applied to adjust the Vary-VIT-B resolution from $1024\times1024$ to $768\times768$, and the decoder dimensions from 1024 to 512/384. Additionally, multi-token prediction is implemented in PP-FormulaNet-S to enhance inference speed by predicting multiple consecutive tokens in a single forward pass. $\widehat{s}$, $\widehat{e}$ and $\widehat{p}$ represent the start token, end token, and padding token, respectively.
  • Figure A: Visualization of recognition results from different methods on simple formulas. The left image is the test image, and the right image is the prediction result. The blank on the right indicates that there is a syntax error in the generated formula, preventing it from rendering.
  • Figure B: Visualization of recognition results from different methods on middle formulas. The left image is the test image, and the right image is the prediction result. The blank on the right indicates that there is a syntax error in the generated formula, preventing it from rendering.
  • Figure C: Visualization of recognition results from different methods on hard formulas. The left image is the test image, and the right image is the prediction result. The blank on the right indicates that there is a syntax error in the generated formula, preventing it from rendering.
  • ...and 1 more figures