Table of Contents
Fetching ...

TeXBLEU: Automatic Metric for Evaluate LaTeX Format

Kyudan Jung, Nam-Joon Kim, Hyongon Ryu, Sieun Hyeon, Seung-jun Lee, Hyeok-jae Lee

TL;DR

TeXBLEU, a metric for evaluating mathematical expressions in the LaTeX format built on the n-gram-based BLEU metric widely used in translation tasks, is proposed.

Abstract

LaTeX is suitable for creating specially formatted documents in science, technology, mathematics, and computer science. Although the use of mathematical expressions in LaTeX format along with language models is increasing, there are no proper evaluation matrices to evaluate them. In this study, we propose TeXBLEU, a metric for evaluating mathematical expressions in the LaTeX format built on the n-gram-based BLEU metric widely used in translation tasks. The proposed TeXBLEU consists of a predefined tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding. The TeXBLEU score was calculated by replacing BLUE's modified precision score with the similarity of n-gram-based tokens. TeXBLEU showed improvements of 86\%, 121\%, and 610\% over traditional evaluation metrics, such as BLEU, sacreBLEU, and Rouge, respectively, on the MathBridge dataset with 1,000 data points. The code is available at https://github.com/KyuDan1/TeXBLEU.

TeXBLEU: Automatic Metric for Evaluate LaTeX Format

TL;DR

TeXBLEU, a metric for evaluating mathematical expressions in the LaTeX format built on the n-gram-based BLEU metric widely used in translation tasks, is proposed.

Abstract

LaTeX is suitable for creating specially formatted documents in science, technology, mathematics, and computer science. Although the use of mathematical expressions in LaTeX format along with language models is increasing, there are no proper evaluation matrices to evaluate them. In this study, we propose TeXBLEU, a metric for evaluating mathematical expressions in the LaTeX format built on the n-gram-based BLEU metric widely used in translation tasks. The proposed TeXBLEU consists of a predefined tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding. The TeXBLEU score was calculated by replacing BLUE's modified precision score with the similarity of n-gram-based tokens. TeXBLEU showed improvements of 86\%, 121\%, and 610\% over traditional evaluation metrics, such as BLEU, sacreBLEU, and Rouge, respectively, on the MathBridge dataset with 1,000 data points. The code is available at https://github.com/KyuDan1/TeXBLEU.
Paper Structure (14 sections, 4 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 4 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Limitations of existing metrics in evaluating mathematical expressions in LaTeX format. Both BLEU and sacreBLEU fail to preserve meaning due to tokenizers that are incompatible with LaTeX, resulting in poorly tokenized expressions (highlighted in red). A lower CER generally indicates higher similarity. However, we observed high CER values even when identical LaTeX expressions were input. For WER, minor differences in spacing caused all words to be marked as non-matching, leading to a WER of 1.
  • Figure 2: An illustration of the key calculation process of TeXBLEU. First, spacing preprocessing is applied to both the predicted and reference sentences. Then, the sentences are tokenized using a tokenizer built on our LaTeX corpus. Positional encodings are obtained from GPT-2's wpe for each token, and token embeddings are extracted from a fine-tuned GPT-2 embedding model.
  • Figure 3: An illustration depicting the main experiment. The T5-Large model, fine-tuned on the MathBridge dataset, is used to input mathematical spoken expressions. The predicted LaTeX format output is then compared to the original ground truth in MathBridge using various metrics.
  • Figure 4: The results of tokenizing the quadratic formula in LaTeX format using various models. The green boxes indicate sections where LaTeX commands were successfully tokenized as complete chunks, while the red boxes represent sections where the LaTeX commands were fragmented, losing their meaning during tokenization. It is evident that the tokenizer we developed using the arXiv paper corpus performs the best in accurately tokenizing LaTeX commands.