Table of Contents
Fetching ...

Image-to-LaTeX Converter for Mathematical Formulas and Text

Daniil Gurgurov, Aleksey Morshnev

TL;DR

This paper tackles the problem of translating images of mathematical formulas and text into LaTeX code by deploying a TrOCR-inspired vision encoder–decoder that combines a Swin Transformer encoder with a GPT-2 decoder. The authors pursue a two-stage training regime: a broad base trained on printed formulas and a LoRA-enhanced fine-tuning on handwritten formulas, enabling efficient adaptation with a total of about 3.1M trainable parameters in the fine-tuned model. They demonstrate competitive performance, achieving a Google BLEU of roughly 0.67 and comparing favorably with Pix2Tex and Sumen while trailing TexTeller, with careful attention to dataset composition and fairness. The work contributes open-source models and end-to-end training code, including GPU-optimized workflows, to advance OCR for mathematical and scientific documents and facilitate reproducibility and further research.

Abstract

In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.

Image-to-LaTeX Converter for Mathematical Formulas and Text

TL;DR

This paper tackles the problem of translating images of mathematical formulas and text into LaTeX code by deploying a TrOCR-inspired vision encoder–decoder that combines a Swin Transformer encoder with a GPT-2 decoder. The authors pursue a two-stage training regime: a broad base trained on printed formulas and a LoRA-enhanced fine-tuning on handwritten formulas, enabling efficient adaptation with a total of about 3.1M trainable parameters in the fine-tuned model. They demonstrate competitive performance, achieving a Google BLEU of roughly 0.67 and comparing favorably with Pix2Tex and Sumen while trailing TexTeller, with careful attention to dataset composition and fairness. The work contributes open-source models and end-to-end training code, including GPU-optimized workflows, to advance OCR for mathematical and scientific documents and facilitate reproducibility and further research.

Abstract

In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
Paper Structure (19 sections, 3 figures, 1 table)

This paper contains 19 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Proposed architecture for training a base formula recognition model.
  • Figure 2: Base-Model Training Details.
  • Figure 3: LoRa-Model Training Details.