Table of Contents
Fetching ...

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

Sieun Hyeon, Kyudan Jung, Jaehee Won, Nam-Joon Kim, Hyun Gon Ryu, Hyuk-Jae Lee, Jaeyoung Do

TL;DR

MathSpeech addresses the problem of accurate mathematical subtitles by coupling ASR with small language models to correct ASR errors and translate spoken math into LaTeX. The approach uses a two-stage, end-to-end trained pipeline with two 120M-parameter T5-small models (an Error Corrector and a LaTeX Translator) and a loss function that emphasizes LaTeX accuracy. On a new MathSpeech benchmark (1,101 lecture samples), the method achieves LaTeX-translation performance that surpasses GPT-4o and Gemini-Pro across CER, ROUGE, and BLEU metrics, while delivering low latency (~0.45 s for 5 s of speech). This work demonstrates the feasibility of high-quality, scalable mathematical subtitles using lightweight models, enabling clearer math communication in lectures and online videos.

Abstract

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates $\LaTeX{}$ generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for $\LaTeX{}$ translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula

TL;DR

MathSpeech addresses the problem of accurate mathematical subtitles by coupling ASR with small language models to correct ASR errors and translate spoken math into LaTeX. The approach uses a two-stage, end-to-end trained pipeline with two 120M-parameter T5-small models (an Error Corrector and a LaTeX Translator) and a loss function that emphasizes LaTeX accuracy. On a new MathSpeech benchmark (1,101 lecture samples), the method achieves LaTeX-translation performance that surpasses GPT-4o and Gemini-Pro across CER, ROUGE, and BLEU metrics, while delivering low latency (~0.45 s for 5 s of speech). This work demonstrates the feasibility of high-quality, scalable mathematical subtitles using lightweight models, enabling clearer math communication in lectures and online videos.

Abstract

In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i of x), instead of the concise format (i.e., ), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.

Paper Structure

This paper contains 20 sections, 3 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Method for Collecting ASR Error Results.
  • Figure 2: This figure compares 2-beam search and our method. The left shows top-2 beam search by a single ASR model, while the right shows top-1 beam search by two ASR models.
  • Figure 3: Our pipeline that converts the lecturer's voice into LaTeX.
  • Figure 4: The method of training MathSpeech in an end-to-end manner