Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding
Husein Zolkepli, Aisyah Razak, Kamarul Adha, Ariff Nazhan
TL;DR
This work develops a Malaysia-specialized extension of Mistral 7B by pretraining on a Malay-context corpus and releasing context-length variants up to $32768$ with an instruction-tuned $16384$-context model, Malaysian Mistral. It integrates extensive public Malay data (including Wikipedia, government documents, and online content), deduplicates and postprocesses the data, and trains with two distinct context-length regimes. The authors significantly augment the model with diverse synthetic instruction datasets and multiturn QA/coding datasets to improve instruction-following and conversational abilities in Malay. Evaluation on the Tatabahasa benchmark and cross-model comparisons suggest competitive grammar performance and strong potential for local-language NLP tasks, with open-source releases on HuggingFace enabling broader adoption. The work also outlines a pathway toward multi-modal future capabilities to further broaden AI accessibility for Malaysia.
Abstract
In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale language model, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral. Our experiments demonstrate the efficacy of continue pretraining and the influence of extended context lengths on Mistral 7B's language understanding capabilities. Additionally, we release a model specifically tuned with a 16384 context length instruction, showcasing its potential for capturing nuanced language intricacies. Furthermore, our research contributes to the benchmarking of Malaysian Mistral against prominent language models, including ChatGPT3.5 and Claude 2. We present compelling results indicating Malaysian Mistral's superior performance on Tatabahasa (Malay grammar) test set, particularly when fine-tuned with instructions. All models released at https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c
