Table of Contents
Fetching ...

Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

Stefan Krsteski, Matea Tashkovska, Borjan Sazdov, Hristijan Gjoreski, Branislav Gerazov

TL;DR

The paper tackles the lack of robust Macedonian NLP support in LLMs by creating a large, curated Macedonian corpus ($3.5\times 10^{9}$ words) and an instruction-tuning dataset, plus a seven-benchmark evaluation suite, culminating in the $8\mathrm{B}$-parameter domestic-yak model. It demonstrates that continued monolingual pretraining on high-quality data followed by instruction tuning yields substantial gains, with domestic-yak outperforming many larger models in 8B-class benchmarks and achieving competitive results with up to $10\times$ larger multilingual systems. Native-speaker evaluations further confirm higher fluency, grammatical accuracy, and cultural relevance for domestic-yak-instruct relative to a much larger model. Collectively, the work provides a reproducible blueprint for expanding LLM capabilities in underrepresented languages through targeted data creation and monolingual adaptation, and it openly releases resources to accelerate progress in Macedonian NLP and beyond.

Abstract

The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.

Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language

TL;DR

The paper tackles the lack of robust Macedonian NLP support in LLMs by creating a large, curated Macedonian corpus ( words) and an instruction-tuning dataset, plus a seven-benchmark evaluation suite, culminating in the -parameter domestic-yak model. It demonstrates that continued monolingual pretraining on high-quality data followed by instruction tuning yields substantial gains, with domestic-yak outperforming many larger models in 8B-class benchmarks and achieving competitive results with up to larger multilingual systems. Native-speaker evaluations further confirm higher fluency, grammatical accuracy, and cultural relevance for domestic-yak-instruct relative to a much larger model. Collectively, the work provides a reproducible blueprint for expanding LLM capabilities in underrepresented languages through targeted data creation and monolingual adaptation, and it openly releases resources to accelerate progress in Macedonian NLP and beyond.

Abstract

The increase in technological adoption worldwide comes with demands for novel tools to be used by the general population. Large Language Models (LLMs) provide a great opportunity in this respect, but their capabilities remain limited for low-resource languages, restricting applications in countries where such languages are spoken. We create several resources to facilitate the adoption of LLMs and to support research advancements for Macedonian. We collect the largest Macedonian corpus to date, consisting of 40GB of textual data and totaling 3.5B words. To support conversational applications, we collect a 106k-instance instruction dataset, carefully built to be culturally grounded. For evaluation, we construct a Macedonian evaluation suite covering seven benchmarks. Finally, we train domestic-yak, a state-of-the-art 8B-parameter model, on our curated datasets and evaluate it against eight baseline models using the newly constructed benchmark suite. Our model outperforms all existing models in the 8B parameter range across all benchmarks, and achieves performance comparable to models up to 10x larger. Furthermore, a qualitative analysis with native speakers reveals that our model is preferred over larger counterparts, receiving higher ratings for grammatical correctness and cultural appropriateness. All datasets, code, and model weights are openly released, setting a foundation for advancing LLMs in similarly underrepresented languages. These resources are publicly available at github.com/LVSTCK for source code, and at huggingface.co/LVSTCK for pretrained model weights and data.

Paper Structure

This paper contains 24 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Average fluency and relevance Likert ratings per model. domestic-yak-instruct outperforms Llama 3.1 70B Instruct in both dimensions (Wilcoxon signed-rank test, Bonferroni corrected, $p$fluency=1.83$\times$10-11, $p$relevance=1.192$\times$10-3). Statistical significance annotations: * if $p \in [0.05, 10^{-2})$; ** if $p \in [10^{-2}, 10^{-3})$; *** if $p \in [10^{-3}, 10^{-4})$; and **** if $p \leq 10^{-4}$.
  • Figure 2: Token length distribution in the SFT dataset. The red dashed line indicates the 4,096-token cutoff, which covers 97.4% of all samples.
  • Figure 3: Distribution of Topics in the Instruction Dataset. Question Answering tasks comprise the majority (58.5%), followed by Chat Conversations (33.0%), with Reasoning and Other categories making up smaller portions (5.3% and 3.2% respectively).