Table of Contents
Fetching ...

MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

Yilun Liu, Chunguang Zhao, Xinhua Yang, Hongyong Zeng, Shimin Tao, Weibin Meng, Minggui He, Yan Yu, Hongxia Ma, Li Zhang, Daimeng Wei, Boxing Chen

TL;DR

MIDB introduces an automatic Multilingual Instruction Data Booster trained on 36.8k expert revisions across 16 languages to enhance multilingual instruction data quality and localization. By constructing the Multilingual Expert Boosted (MEB) dataset and a unified training objective, MIDB improves content accuracy, MT defect handling, and cultural localization, yielding better instruction-following and cultural understanding in multilingual LLMs. Extensive automatic and human evaluations across translated benchmarks (AlpacaEval-16L, MT-Bench-16L, BLEnD) and OOD data demonstrate consistent gains, including substantial improvements in cultural specificity. The work highlights potential for reducing English-centric bias and advancing linguistic and cultural equality in AI, while noting limitations in language coverage and scalability of human labor.

Abstract

Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages, leading to cultural inequality in trained LLMs. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced, suggesting an improved linguistic and cultural equality.

MIDB: Multilingual Instruction Data Booster for Enhancing Cultural Equality in Multilingual Instruction Synthesis

TL;DR

MIDB introduces an automatic Multilingual Instruction Data Booster trained on 36.8k expert revisions across 16 languages to enhance multilingual instruction data quality and localization. By constructing the Multilingual Expert Boosted (MEB) dataset and a unified training objective, MIDB improves content accuracy, MT defect handling, and cultural localization, yielding better instruction-following and cultural understanding in multilingual LLMs. Extensive automatic and human evaluations across translated benchmarks (AlpacaEval-16L, MT-Bench-16L, BLEnD) and OOD data demonstrate consistent gains, including substantial improvements in cultural specificity. The work highlights potential for reducing English-centric bias and advancing linguistic and cultural equality in AI, while noting limitations in language coverage and scalability of human labor.

Abstract

Despite doubts on data quality, instruction synthesis has been widely applied into instruction tuning (IT) of LLMs as an economic and rapid alternative. Recent endeavors focus on improving data quality for synthesized instruction pairs in English and have facilitated IT of English-centric LLMs. However, data quality issues in multilingual synthesized instruction pairs are even more severe, since the common synthesizing practice is to translate English synthesized data into other languages using machine translation (MT). Besides the known content errors in these English synthesized data, multilingual synthesized instruction data are further exposed to defects introduced by MT and face insufficient localization of the target languages, leading to cultural inequality in trained LLMs. In this paper, we propose MIDB, a Multilingual Instruction Data Booster to automatically address the quality issues in multilingual synthesized data. MIDB is trained on around 36.8k revision examples across 16 languages by human linguistic experts, thereby can boost the low-quality data by addressing content errors and MT defects, and improving localization in these synthesized data. Both automatic and human evaluation indicate that not only MIDB steadily improved instruction data quality in 16 languages, but also the instruction-following and cultural-understanding abilities of multilingual LLMs fine-tuned on MIDB-boosted data were significantly enhanced, suggesting an improved linguistic and cultural equality.

Paper Structure

This paper contains 43 sections, 1 equation, 10 figures, 5 tables.

Figures (10)

  • Figure 1: An example suggesting improved cultural equality in MIDB-boosted LLMs: it successfully identified a popular second-language in the cultural context of Greece. EL, RU, ES, etc., are language codes. See code-name mapping of the 16 supported languages in Appendix \ref{['language_dict']}.
  • Figure 2: Illustrations on (a) training stage and (b) inference stage of MIDB.
  • Figure 3: Typical issues addressed in MEB Dataset.
  • Figure 4: Template of MIDB's training samples.
  • Figure 5: LLM-as-judge's evaluation on LLMs trained with MIDB-boosted/pre-boosted Alpaca datasets and 3 strong LLMs.
  • ...and 5 more figures