Image-to-LaTeX Converter for Mathematical Formulas and Text
Daniil Gurgurov, Aleksey Morshnev
TL;DR
This paper tackles the problem of translating images of mathematical formulas and text into LaTeX code by deploying a TrOCR-inspired vision encoder–decoder that combines a Swin Transformer encoder with a GPT-2 decoder. The authors pursue a two-stage training regime: a broad base trained on printed formulas and a LoRA-enhanced fine-tuning on handwritten formulas, enabling efficient adaptation with a total of about 3.1M trainable parameters in the fine-tuned model. They demonstrate competitive performance, achieving a Google BLEU of roughly 0.67 and comparing favorably with Pix2Tex and Sumen while trailing TexTeller, with careful attention to dataset composition and fairness. The work contributes open-source models and end-to-end training code, including GPU-optimized workflows, to advance OCR for mathematical and scientific documents and facilitate reproducibility and further research.
Abstract
In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
