Tagengo: A Multilingual Chat Dataset
Peter Devine
TL;DR
This work tackles the gap in open-source multilingual LLM capabilities by creating Tagengo, a high-quality dataset of over 70k prompts across 74 languages, and using it to train Suzume 8B multilingual and Suzume 8B Japanese LLMs. The study demonstrates that multilingual training yields strong non-English performance, with transfer learning also improving Japanese despite not being trained exclusively on Japanese data. On MT-Bench benchmarks, the open-source Suzume models outperform comparable-size peers in several languages, though GPT-3.5-Turbo still leads in many cases, underscoring both the progress and remaining gaps for open-source multilingual LLMs. The authors release data, training code, and models, emphasizing the importance of diverse multilingual data for accessible and robust cross-language AI systems.
Abstract
Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually. We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language. These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.
