Table of Contents
Fetching ...

Xmodel-1.5: An 1B-scale Multilingual LLM

Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling

TL;DR

Xmodel-1.5 introduces a 1B-parameter multilingual LLM trained on 2T tokens using a 65,280-token unigram tokenizer, combining RoPE, SwiGLU, and grouped-query attention to balance efficiency and multilingual coverage. The pretraining leverages a diverse corpus (MultiWiki, CulturaX) with targeted data for low-resource languages and Thai, followed by instruction tuning with RAG and RAFT data to boost instruction-following and e-commerce capabilities. Empirical results show competitive performance against 1B baselines, with strong Thai results and improvements over PolyLM-1.7B on several benchmarks; a Thai-specific dataset (Xdata_Thai) is released to support low-resource language research. The work includes a collaboration with Chulalongkorn University to collect human evaluations and analyses Thai linguistic challenges, informing future directions in handling gendered language, time expressions, and culturally nuanced idioms, while contributing open resources for the community.

Abstract

We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM-1.5

Xmodel-1.5: An 1B-scale Multilingual LLM

TL;DR

Xmodel-1.5 introduces a 1B-parameter multilingual LLM trained on 2T tokens using a 65,280-token unigram tokenizer, combining RoPE, SwiGLU, and grouped-query attention to balance efficiency and multilingual coverage. The pretraining leverages a diverse corpus (MultiWiki, CulturaX) with targeted data for low-resource languages and Thai, followed by instruction tuning with RAG and RAFT data to boost instruction-following and e-commerce capabilities. Empirical results show competitive performance against 1B baselines, with strong Thai results and improvements over PolyLM-1.7B on several benchmarks; a Thai-specific dataset (Xdata_Thai) is released to support low-resource language research. The work includes a collaboration with Chulalongkorn University to collect human evaluations and analyses Thai linguistic challenges, informing future directions in handling gendered language, time expressions, and culturally nuanced idioms, while contributing open resources for the community.

Abstract

We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM-1.5

Paper Structure

This paper contains 22 sections, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Total multilingual data tokens during the pretraining phase sourced from MultiWiki and CulturaX.
  • Figure 2: Data distribution during pretraining between 44,000 and 190,000 steps.
  • Figure 3: Data distribution during the decay phase.
  • Figure 4: The trend of training and validation loss during pretraining.
  • Figure 5: Comparison of performance in multilingual tasks between PolyLM 1.7B and Xmodel-1.5 1B
  • ...and 10 more figures