Table of Contents
Fetching ...

Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting

Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, Yuxia Wang, Zhuohan Xie, Rahul Pal, Daniil Orel, Parvez Mullah, Diana Turmakhan, Maiya Goloburda, Mohammed Kamran, Samujjwal Ghosh, Bokang Jia, Jonibek Mansurov, Mukhammed Togmanov, Debopriyo Banerjee, Nurkhan Laiyk, Akhmed Sakip, Xudong Han, Ekaterina Kochmar, Alham Fikri Aji, Aaryamonvikram Singh, Alok Anil Jadhav, Satheesh Katipomu, Samta Kamboj, Monojit Choudhury, Gurpreet Gosal, Gokulakrishnan Ramakrishnan, Biswajit Mishra, Sarath Chandran, Avraham Sheinin, Natalia Vassilieva, Neha Sengupta, Preslav Nakov

TL;DR

Sherkala-Chat (8B) tackles the critical gap in Kazakh NLP by building an open-weight, Kazakh-centric LLM derived from LLaMA-3.1-8B and further enhanced through continual pretraining on 45.3B tokens across Kazakh, English, Russian, and Turkish. The work combines a targeted tokenizer expansion and embedding initialization with comprehensive instruction tuning and safety alignment to achieve state-of-the-art Kazakh performance while remaining competitive in English and Russian. Extensive evaluations across downstream tasks, generation benchmarks, and safety assessments demonstrate strong Kazakh knowledge, robust instruction-following, and improved regional safety responses, positioning Sherkala-Chat (8B) as a leading open-weight option for Kazakh AI applications. The authors also emphasize transparency by releasing training procedures, data sources, and safety methodologies to support responsible deployment and further research in Kazakh NLP.

Abstract

Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outper-forming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. To ensure effective and responsible alignment, we leverage translated instruction datasets, a Kazakhstan-specific instruction dataset that is automatically constructed and manually verified, and Kazakh-specific safety data. We release Sherkala-Chat (8B) as an open-weight model, along with a detailed description of its training, alignment, and evaluation, to support research and real-world applications for Kazakh speakers.

Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting

TL;DR

Sherkala-Chat (8B) tackles the critical gap in Kazakh NLP by building an open-weight, Kazakh-centric LLM derived from LLaMA-3.1-8B and further enhanced through continual pretraining on 45.3B tokens across Kazakh, English, Russian, and Turkish. The work combines a targeted tokenizer expansion and embedding initialization with comprehensive instruction tuning and safety alignment to achieve state-of-the-art Kazakh performance while remaining competitive in English and Russian. Extensive evaluations across downstream tasks, generation benchmarks, and safety assessments demonstrate strong Kazakh knowledge, robust instruction-following, and improved regional safety responses, positioning Sherkala-Chat (8B) as a leading open-weight option for Kazakh AI applications. The authors also emphasize transparency by releasing training procedures, data sources, and safety methodologies to support responsible deployment and further research in Kazakh NLP.

Abstract

Llama-3.1-Sherkala-8B-Chat, or Sherkala-Chat (8B) for short, is a state-of-the-art instruction-tuned open generative large language model (LLM) designed for Kazakh. Sherkala-Chat (8B) aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B model, Sherkala-Chat (8B) is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish. With 8 billion parameters, it demonstrates strong knowledge and reasoning abilities in Kazakh, significantly outper-forming existing open Kazakh and multilingual models of similar scale while achieving competitive performance in English. To ensure effective and responsible alignment, we leverage translated instruction datasets, a Kazakhstan-specific instruction dataset that is automatically constructed and manually verified, and Kazakh-specific safety data. We release Sherkala-Chat (8B) as an open-weight model, along with a detailed description of its training, alignment, and evaluation, to support research and real-world applications for Kazakh speakers.

Paper Structure

This paper contains 57 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Kazakh MMLU Accuracy by Language Mixture
  • Figure 2: Model performance comparison across benchmarks in Kazakh, with scores evaluated using GPT-4o as the judge. Llama-3.1-Sherkala-8B-Chat (Sherkala-Chat (8B)) outperforms Qwen, Llama-3.1 and KazLLM.
  • Figure 3: Our Kazakh preprocessing pipeline.
  • Figure 4: Pairwise comparison for Kazakh, Russian and English text generation between Sherkala-Chat (8B) and KazLLM-1.0-8B across MT-Instructions-80 and Vicuna-Instructions-80.
  • Figure 5: Examples of how the raw data looks like after being transformed to follow the Llama-3.1 Chat template: the prompt is in green, and the response is in red. In the figure, (a) shows a multi-turn instruction in English, and (b) shows the same interaction in Kazakh.