Table of Contents
Fetching ...

Open foundation models for Azerbaijani language

Jafar Isbarov, Kavsar Huseynova, Elvin Mammadov, Mammad Hajili, Duygu Ataman

TL;DR

This work addresses the scarcity of open, benchmarked Azerbaijani foundation models by introducing a large monolingual corpus (DOLLMA) and a corresponding family of encoder-only models (aLLMA) trained from scratch. It builds a practical evaluation suite including three new NLU tasks (AZE-SCI, AZE-NSP, CB-MCQ) and leverages existing datasets (WikiANN, SQuAD, LDQuAd, MRPC) to comprehensively compare open-source models. The experiments show that aLLMA-Base achieves strong performance for a monolingual Azerbaijani model and that both monolingual and multilingual approaches have merit in low-resource settings, while highlighting challenges in dataset quality and pretraining scope. The paper lays groundwork for larger-scale corpora, expanded benchmarks, and future exploration of generative Azerbaijani foundation models, with potential real-world impact on Azerbaijani NLP tools and applications.

Abstract

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.

Open foundation models for Azerbaijani language

TL;DR

This work addresses the scarcity of open, benchmarked Azerbaijani foundation models by introducing a large monolingual corpus (DOLLMA) and a corresponding family of encoder-only models (aLLMA) trained from scratch. It builds a practical evaluation suite including three new NLU tasks (AZE-SCI, AZE-NSP, CB-MCQ) and leverages existing datasets (WikiANN, SQuAD, LDQuAd, MRPC) to comprehensively compare open-source models. The experiments show that aLLMA-Base achieves strong performance for a monolingual Azerbaijani model and that both monolingual and multilingual approaches have merit in low-resource settings, while highlighting challenges in dataset quality and pretraining scope. The paper lays groundwork for larger-scale corpora, expanded benchmarks, and future exploration of generative Azerbaijani foundation models, with potential real-world impact on Azerbaijani NLP tools and applications.

Abstract

The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support.
Paper Structure (17 sections, 2 figures, 4 tables)

This paper contains 17 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Training loss for aLLMA-Small, aLLMA-Base, and aLLMA-Large models.
  • Figure 2: Performance comparison among BERT models of the same configuration. aLLMA-Base outperforms the other models in 4 out of 6 benchmarks.