Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting
Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokulakrishnan Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Preslav Nakov
TL;DR
Sherkala-Chat (8B) tackles the critical gap in Kazakh NLP by building an open-weight, Kazakh-centric LLM derived from LLaMA-3.1-8B and further enhanced through continual pretraining on 45.3B tokens across Kazakh, English, Russian, and Turkish. The work combines a targeted tokenizer expansion and embedding initialization with comprehensive instruction tuning and safety alignment to achieve state-of-the-art Kazakh performance while remaining competitive in English and Russian. Extensive evaluations across downstream tasks, generation benchmarks, and safety assessments demonstrate strong Kazakh knowledge, robust instruction-following, and improved regional safety responses, positioning Sherkala-Chat (8B) as a leading open-weight option for Kazakh AI applications. The authors also emphasize transparency by releasing training procedures, data sources, and safety methodologies to support responsible deployment and further research in Kazakh NLP.
Abstract
Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outper-forming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. To ensure effective and responsible alignment, we leverage translated instruction datasets, a Kazakhstan-specific instruction dataset that is automatically constructed and manually verified, and Kazakh-specific safety data. We release Sherkala-Chat (8B) as an open-weight model, along with a detailed description of its training, alignment, and evaluation, to support research and real-world applications for Kazakh speakers.
