Question Translation Training for Better Multilingual Reasoning

Wenhao Zhu; Shujian Huang; Fei Yuan; Shuaijie She; Jiajun Chen; Alexandra Birch

Question Translation Training for Better Multilingual Reasoning

Wenhao Zhu, Shujian Huang, Fei Yuan, Shuaijie She, Jiajun Chen, Alexandra Birch

TL;DR

This paper tackles the gap in multilingual reasoning for large language models by introducing Question Alignment (QAlign), a two-stage training framework that first aligns non-English questions to English (Stage I) and then tunes with English instruction data (Stage II). By fine-tuning on multilingual questions rather than translated CoT, the method efficiently leverages English expertise to improve non-English reasoning on benchmarks MGSM and MSVAMP, outperforming translate-training baselines. The approach yields significant multilingual gains (up to double-digit percentage points) and improved cross-language consistency, with further improvements when multilingual supervision data are available. The work demonstrates a practical pathway to robust multilingual reasoning without fully translating large instruction corpora, with implications for scalable multilingual AI systems and future exploration on larger models and diverse domains.

Abstract

Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. This is unsurprising given that their training data largely consists of English text and instructions. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. This approach not only incurs high cost, but also results in poorly translated data due to the non-standard formatting of mathematical chain-of-thought. In this paper, we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data. In this way we perform targeted, in-domain language alignment which makes best use of English instruction data to unlock the LLMs' multilingual reasoning abilities. Experimental results on LLaMA2-13B show that question alignment leads to consistent improvements over the translate-training approach: an average improvement of 11.3% and 16.1% accuracy across ten languages on the MGSM and MSVAMP multilingual reasoning benchmarks. The project will be available at: https://github.com/NJUNLP/QAlign.

Question Translation Training for Better Multilingual Reasoning

TL;DR

Abstract

Paper Structure (34 sections, 2 equations, 3 figures, 11 tables)

This paper contains 34 sections, 2 equations, 3 figures, 11 tables.

Introduction
Related Work
Large language model
Multilingual mathematical reasoning
Methodology
Stage I: Question Alignment
Stage II: Response Alignment
Monolingual supervision setting
Mixed supervision setting
Experiment Setting
Base LLM
Training Dataset
Training Details
Baseline Systems
Evaluation Dataset
...and 19 more sections

Figures (3)

Figure 1: Illustration of our devised two-step training framework. At training stage I (question alignment), we use a set of multilingual questions for translation training. At training stage II (response alignment), we use cutting-edge English-only instruction data for fine-tuning. Due to the established language alignment in stage I, we can utilize LLM's expertise in English to enhance its performance on non-English tasks.
Figure 2: Effects of tuning language-aligned LLM with mixed supervised data. Generally, incoporating multilingual supervised data into our framework can achieve a higher ceiling for average multilingual performance.
Figure 3: Comparing the prediction consistency of different systems. Darker blue denotes higher level of prediction consistency. Question alignment stage always brings improvement to the consistency of predicted answers.

Question Translation Training for Better Multilingual Reasoning

TL;DR

Abstract

Question Translation Training for Better Multilingual Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)