Table of Contents
Fetching ...

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

Hyunwoo Ko, Guijin Son, Dasol Choi

TL;DR

<3-5 sentence high-level summary> This work addresses the multilingual mathematical reasoning gap observed in large language models by focusing on Korean. It introduces HRM8K, a bilingual benchmark with 8,011 English-Korean math problems, and the UST method, which anchors reasoning in English and translates results back into Korean. Through training on ~130k synthetic samples, UST yields a 10.91% improvement on HRM8K and reduces the multilingual gap from 11.6% to 0.7%, with demonstrated generalization to other Korean domains. The authors publicly release the benchmark, training data, and models to enable broader evaluation and reuse.

Abstract

Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

TL;DR

<3-5 sentence high-level summary> This work addresses the multilingual mathematical reasoning gap observed in large language models by focusing on Korean. It introduces HRM8K, a bilingual benchmark with 8,011 English-Korean math problems, and the UST method, which anchors reasoning in English and translates results back into Korean. Through training on ~130k synthetic samples, UST yields a 10.91% improvement on HRM8K and reduces the multilingual gap from 11.6% to 0.7%, with demonstrated generalization to other Korean domains. The authors publicly release the benchmark, training data, and models to enable broader evaluation and reuse.

Abstract

Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.
Paper Structure (43 sections, 4 equations, 14 figures, 10 tables)

This paper contains 43 sections, 4 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Example of UST process. When presented with a problem in Korean, the model generates Korean answers through the following processes highlighted in yellow: Understanding the Question, Solving the Question, and Translating the Solution into Korean.
  • Figure 2: Comparison of HRM8K performance (vertical axis) and three additional benchmarks (KMMLU, HAERAE-Bench, FLORES-200) for Qwen2.5 and Llama-3.1/2 across different model sizes. TE2E (blue) translates Korean input to English before solving; E2E (red) uses an English prompt from the start.
  • Figure 3: Screenshot of our Streamlit-based OCR validation tool, used to compare source documents with OCR outputs and correct any errors.
  • Figure 4: Reward model evaluation result on UST dataset. The samples were categorized into three groups based on the reward model score: high (RM Score $>$ 1, red), low (RM Score $<$ 0, blue), and medium (0 $\leq$ RM Score $\leq$ 1, green).
  • Figure 5: Qwen2.5-7B-Instruct model performance trends across epochs during training on high (red) and low (blue) datasets. The evaluation results of the original Qwen2.5-7B-Instruct model for K2K and E2E prompting were depicted with the dash-dotted lines.
  • ...and 9 more figures