Table of Contents
Fetching ...

Open Llama2 Model for the Lithuanian Language

Artūras Nakvosas, Povilas Daniušis, Vytas Mulevičius

TL;DR

This work addresses the scarcity of open, high-quality LLMs for Lithuanian by introducing open, Llama2-based models LT-Llama2-7B and LT-Llama2-13B. The authors train these models in two phases—pretraining on the CulturaX Lithuanian component and fine-tuning on Alpaca-based translations via LTQAV1—without using PEFT, and release the models, a Lithuanian QA dataset, and translated benchmarks in an open repository. Evaluation shows dramatically lower perplexities for the LT-Llama2 models on ltqav1 compared with Llama2 and Llama3, while language understanding benchmarks reveal mixed gains that depend on data quality. The work provides a valuable, reproducible resource for regional NLP research and highlights the ongoing need for richer, well-documented regional data to improve benchmark performance.

Abstract

In this paper, we propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language, including an accompanying question/answer (Q/A) dataset and translations of popular LLM benchmarks. We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process. We also conduct an empirical evaluation, comparing the perplexities of the proposed LLMs with those of other modern open LLMs. In addition, benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential for achieving models that perform efficiently on these benchmarks. The full realisations of the described LLMs are available in the accompanying open repository~\url{https://huggingface.co/neurotechnology}.

Open Llama2 Model for the Lithuanian Language

TL;DR

This work addresses the scarcity of open, high-quality LLMs for Lithuanian by introducing open, Llama2-based models LT-Llama2-7B and LT-Llama2-13B. The authors train these models in two phases—pretraining on the CulturaX Lithuanian component and fine-tuning on Alpaca-based translations via LTQAV1—without using PEFT, and release the models, a Lithuanian QA dataset, and translated benchmarks in an open repository. Evaluation shows dramatically lower perplexities for the LT-Llama2 models on ltqav1 compared with Llama2 and Llama3, while language understanding benchmarks reveal mixed gains that depend on data quality. The work provides a valuable, reproducible resource for regional NLP research and highlights the ongoing need for richer, well-documented regional data to improve benchmark performance.

Abstract

In this paper, we propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language, including an accompanying question/answer (Q/A) dataset and translations of popular LLM benchmarks. We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process. We also conduct an empirical evaluation, comparing the perplexities of the proposed LLMs with those of other modern open LLMs. In addition, benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential for achieving models that perform efficiently on these benchmarks. The full realisations of the described LLMs are available in the accompanying open repository~\url{https://huggingface.co/neurotechnology}.
Paper Structure (6 sections, 1 equation, 8 figures, 5 tables)

This paper contains 6 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Losses (y-axis) vs training steps (x-axis) during the model's pretraining.
  • Figure 2: Percentage of the Lithuanian component of the CulturaX dataset used in the pretraining (x-axis) vs. corresponding average perplexity (y-axis).
  • Figure 3: Accuracies (y-axis) of LMEH benchmarks for LT-Llama2-7B model, pretrained with different proportions of Lithuanian component of CulturaX dataset (x-axis). The MMLU benchmarks are summarized in mmlu_lt.
  • Figure 4: Accuracies (y-axis) of LMEH benchmarks for LT-Llama2-13B model, pretrained with different proportions of Lithuanian component of CulturaX dataset (x-axis). The MMLU benchmarks are summarized in mmlu_lt.
  • Figure 5: Source distribution of the Lithuanian component of the CulturaX dataset.
  • ...and 3 more figures