Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States
Jurgita Kapočiūtė-Dzikienė, Toms Bergmanis, Mārcis Pinnis
TL;DR
The paper investigates whether locally deployable open-weight LLMs can effectively support Baltic languages (Lithuanian, Latvian, Estonian) in privacy-sensitive contexts. It evaluates multiple model families (Llama 3, Gemma 2, NeMo, Phi) across machine translation, MCQA and free-form text generation using zero-shot FLORES-200 translations, Belebele MCQA, and human-rated Lithuanian/Latvian outputs, with 4-bit quantization on local hardware. The findings show Gemma 2 (notably the 27B variant) approaching top-tier commercial MT/QA performance, while many open-weight models still suffer from lexical hallucinations and limited coverage of lesser-spoken languages; quantization is robust for Gemma 2 but more detrimental for Llama models. The work highlights the practical viability of sovereign AI for governmental and defense contexts, while emphasizing the need for higher-quality Baltic data and language-specialized tuning to reduce lexical errors and improve generation quality.
Abstract
Although large language models (LLMs) have transformed our expectations of modern language technologies, concerns over data privacy often restrict the use of commercially available LLMs hosted outside of EU jurisdictions. This limits their application in governmental, defence, and other data-sensitive sectors. In this work, we evaluate the extent to which locally deployable open-weight LLMs support lesser-spoken languages such as Lithuanian, Latvian, and Estonian. We examine various size and precision variants of the top-performing multilingual open-weight models, Llama~3, Gemma~2, Phi, and NeMo, on machine translation, multiple-choice question answering, and free-form text generation. The results indicate that while certain models like Gemma~2 perform close to the top commercially available models, many LLMs struggle with these languages. Most surprisingly, however, we find that these models, while showing close to state-of-the-art translation performance, are still prone to lexical hallucinations with errors in at least 1 in 20 words for all open-weight multilingual LLMs.
