Table of Contents
Fetching ...

Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer

Jayaprakash Sundararaj, Akhil Vyas, Benjamin Gonzalez-Maldonado

TL;DR

The paper tackles converting handwritten mathematical expressions into LaTeX code by framing the problem as an image-to-sequence task solved with encoder–decoder architectures. It systematically compares a CNN encoder with an LSTM decoder, a fine-tuned pretrained ResNet50 encoder, and a Vision Transformer with a transformer-based decoder. Results show that Vision Transformer models deliver superior accuracy, BLEU-4 scores, and lower Levenshtein distances compared with CNN–LSTM and ResNet–LSTM baselines, highlighting the effectiveness of self-attention and patch-based representations for this multimodal task. The study also demonstrates the benefits of transfer learning and provides an open implementation to enable reproducibility and further research in automated mathematical transcription.

Abstract

Transforming mathematical expressions into LaTeX poses a significant challenge. In this paper, we examine the application of advanced transformer-based architectures to address the task of converting handwritten or digital mathematical expression images into corresponding LaTeX code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. Additionally, we explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input. Further, we experiment with vision transformer model and compare with Baseline and CNN-LSTM model. Our findings reveal that the vision transformer architectures outperform the baseline CNN-RNN framework, delivering higher overall accuracy and BLEU scores while achieving lower Levenshtein distances. Moreover, these results highlight the potential for further improvement through fine-tuning of model parameters. To encourage open research, we also provide the model implementation, enabling reproduction of our results and facilitating further research in this domain.

Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer

TL;DR

The paper tackles converting handwritten mathematical expressions into LaTeX code by framing the problem as an image-to-sequence task solved with encoder–decoder architectures. It systematically compares a CNN encoder with an LSTM decoder, a fine-tuned pretrained ResNet50 encoder, and a Vision Transformer with a transformer-based decoder. Results show that Vision Transformer models deliver superior accuracy, BLEU-4 scores, and lower Levenshtein distances compared with CNN–LSTM and ResNet–LSTM baselines, highlighting the effectiveness of self-attention and patch-based representations for this multimodal task. The study also demonstrates the benefits of transfer learning and provides an open implementation to enable reproducibility and further research in automated mathematical transcription.

Abstract

Transforming mathematical expressions into LaTeX poses a significant challenge. In this paper, we examine the application of advanced transformer-based architectures to address the task of converting handwritten or digital mathematical expression images into corresponding LaTeX code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. Additionally, we explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input. Further, we experiment with vision transformer model and compare with Baseline and CNN-LSTM model. Our findings reveal that the vision transformer architectures outperform the baseline CNN-RNN framework, delivering higher overall accuracy and BLEU scores while achieving lower Levenshtein distances. Moreover, these results highlight the potential for further improvement through fine-tuning of model parameters. To encourage open research, we also provide the model implementation, enabling reproduction of our results and facilitating further research in this domain.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Formulas breakdown by length
  • Figure 2: Encoder architecture consists of 3 convolution-max pooling blocks (50,200) -> (25,100) -> (12,5) which is flattened and fed into Dense layer (256 units)
  • Figure 3: Pretrained ResNet50 Encoder with LSTM Decoder.
  • Figure 4: Original latex image and the generated patches
  • Figure 5: Transformer encoder architecture
  • ...and 2 more figures