MaLLaM -- Malaysia Large Language Model
Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
TL;DR
MaLLaM presents a Malay-language large language model pre-trained from scratch on a 90-billion-token Malaysian corpus, with 1.1B, 3B, and 5B parameter variants trained on 349GB of data using a custom 32k BPE tokenizer. The work details data collection (public Malay sources, coding data, and synthetic instruction data), deduplication, and a scalable Azure-Ray training pipeline, including stability mitigations during training. Evaluation on the Tatabahasa benchmark shows competitive performance, and fine-tuning experiments demonstrate capabilities in multiturn Malay context QA, coding QA, and Malay instruction tasks. The authors release open-source models and resources to accelerate Malaysian NLP research and applications, highlighting the practical impact of culturally authentic language modeling for Malaysia.
Abstract
Addressing the gap in Large Language Model pretrained from scratch with Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset, equivalent to 90 billion tokens based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language. Although trained on a smaller dataset of 90 billion tokens, our instruction-tuned MaLLaM models perform competitively. When compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models demonstrate notable proficiency, underscoring the effectiveness of our approach in capturing and understanding the nuances of the Malaysian language. MaLLaM models mark a significant contribution to the field, providing comprehensive language representations grounded in Malaysian context. This endeavor aims to pave the way for enhanced natural language understanding and generation tasks specific to the linguistic nuances present in Malaysia. We discuss the training methodology, dataset composition, and the potential impact of MaLLaM in advancing the capabilities of large language models within the context of the Malay language. All models released at https://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f
