Xmodel-1.5: An 1B-scale Multilingual LLM
Wang Qun, Liu Yang, Lin Qingquan, Jiang Ling
TL;DR
Xmodel-1.5 introduces a 1B-parameter multilingual LLM trained on 2T tokens using a 65,280-token unigram tokenizer, combining RoPE, SwiGLU, and grouped-query attention to balance efficiency and multilingual coverage. The pretraining leverages a diverse corpus (MultiWiki, CulturaX) with targeted data for low-resource languages and Thai, followed by instruction tuning with RAG and RAFT data to boost instruction-following and e-commerce capabilities. Empirical results show competitive performance against 1B baselines, with strong Thai results and improvements over PolyLM-1.7B on several benchmarks; a Thai-specific dataset (Xdata_Thai) is released to support low-resource language research. The work includes a collaboration with Chulalongkorn University to collect human evaluations and analyses Thai linguistic challenges, informing future directions in handling gendered language, time expressions, and culturally nuanced idioms, while contributing open resources for the community.
Abstract
We introduce Xmodel-1.5, a 1-billion-parameter multilingual large language model pretrained on 2 trillion tokens, designed for balanced performance and scalability. Unlike most large models that use the BPE tokenizer, Xmodel-1.5 employs a custom unigram tokenizer with 65,280 tokens, optimizing both efficiency and accuracy. The model delivers competitive results across multiple languages, including Thai, Arabic, French, Chinese, and English, outperforming Alibaba's PolyLM-1.7B on respective evaluation datasets. Xmodel-1.5 excels in benchmarks like mMMLU and PIQA, and achieves state-of-the-art results in Thai. To support low-resource language research, we release Xdata_Thai, a Thai-specific evaluation dataset featuring unique linguistic challenges such as gendered particles and idioms. While the model demonstrates strong performance, there is still room for improvement in handling culturally specific nuances. We hope this work contributes to advancements in multilingual AI research. Models and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelLM-1.5
