Table of Contents
Fetching ...

Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History

Yevhen Kostiuk, Oxana Vitman, Łukasz Gagała, Artur Kiulian

TL;DR

This paper tackles the problem of evaluating how well multilingual LLMs understand Lithuanian history across language groups. It proposes a translation-based benchmarking pipeline that converts Lithuanian history questions from the EXAMS dataset into Nordic, Baltic, and other languages, with manual quality checks. The study benchmarks GPT-4o, open-source LLMs, and Nordic-fine-tuned models, finding that GPT-4o consistently leads and that Baltic alignment remains challenging for smaller models, while Nordic models do not surpass multilingual models. The results illuminate cross-lingual knowledge transfer and language fairness, suggesting a need for targeted Baltic-language datasets and fine-tuning strategies to improve low-resource language performance.

Abstract

In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connected groups. We evaluated GPT-4o, LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral 7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other models across language groups, with slightly better results for Baltic and Nordic languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b performed well but showed weaker alignment with Baltic languages. Smaller models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b) demonstrated gaps with LT-related alignment with Baltic languages while performing better on Nordic and other languages. The Nordic fine-tuned models did not surpass multilingual models, indicating that shared cultural or historical context alone does not guarantee better performance.

Towards Multilingual LLM Evaluation for Baltic and Nordic languages: A study on Lithuanian History

TL;DR

This paper tackles the problem of evaluating how well multilingual LLMs understand Lithuanian history across language groups. It proposes a translation-based benchmarking pipeline that converts Lithuanian history questions from the EXAMS dataset into Nordic, Baltic, and other languages, with manual quality checks. The study benchmarks GPT-4o, open-source LLMs, and Nordic-fine-tuned models, finding that GPT-4o consistently leads and that Baltic alignment remains challenging for smaller models, while Nordic models do not surpass multilingual models. The results illuminate cross-lingual knowledge transfer and language fairness, suggesting a need for targeted Baltic-language datasets and fine-tuning strategies to improve low-resource language performance.

Abstract

In this work, we evaluated Lithuanian and general history knowledge of multilingual Large Language Models (LLMs) on a multiple-choice question-answering task. The models were tested on a dataset of Lithuanian national and general history questions translated into Baltic, Nordic, and other languages (English, Ukrainian, Arabic) to assess the knowledge sharing from culturally and historically connected groups. We evaluated GPT-4o, LLaMa3.1 8b and 70b, QWEN2.5 7b and 72b, Mistral Nemo 12b, LLaMa3 8b, Mistral 7b, LLaMa3.2 3b, and Nordic fine-tuned models (GPT-SW3 and LLaMa3 8b). Our results show that GPT-4o consistently outperformed all other models across language groups, with slightly better results for Baltic and Nordic languages. Larger open-source models like QWEN2.5 72b and LLaMa3.1 70b performed well but showed weaker alignment with Baltic languages. Smaller models (Mistral Nemo 12b, LLaMa3.2 3b, QWEN 7B, LLaMa3.1 8B, and LLaMa3 8b) demonstrated gaps with LT-related alignment with Baltic languages while performing better on Nordic and other languages. The Nordic fine-tuned models did not surpass multilingual models, indicating that shared cultural or historical context alone does not guarantee better performance.
Paper Structure (8 sections, 5 figures, 3 tables)

This paper contains 8 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example of the dataset sample in Lithuanian.
  • Figure 2: Example of the chat prompts that the model was presented to evaluate the dataset in Lithuanian and English languages.$<\dots>$ is the actual question that the model is evaluated on.
  • Figure 3: Accuracy results per language for LT-related history questions.
  • Figure 4: Accuracy results per language for general history questions.
  • Figure 5: Accuracy results per language for merged LT-related and general history questions.