Table of Contents
Fetching ...

Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices

Luís Couto Seller, Íñigo Sanz Torres, Adrián Vogel-Fernández, Carlos González Carballo, Pedro Miguel Sánchez Sánchez, Adrián Carruana Martín, Enrique de Miguel Ambite

TL;DR

The paper tackles the problem of deploying NLP on consumer devices for Iberian languages by evaluating a suite of compact LLMs on end-user hardware. It adopts unified benchmarks (Iberobench and a multilingual IroSvA-based irony dataset) to assess reading, translation, math, QA, NLI, paraphrasing, and irony tasks in Catalan, Spanish, Basque, Galician, and Portuguese. Key findings show that models like Gemma-2-9B and Mistral-7B variants achieve strong multilingual performance, while Basque remains notably challenging; instruction tuning generally improves comprehension and QA tasks, whereas distillation can trade off language-specific strength for efficiency. The study provides practical insights into on-device NLP deployment, highlighting the need for targeted pretraining and robust evaluation to close gaps in under-resourced languages. These results inform future development of compact multilingual models and deployment strategies for resource-constrained devices.

Abstract

Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance

Evaluating Compact LLMs for Zero-Shot Iberian Language Tasks on End-User Devices

TL;DR

The paper tackles the problem of deploying NLP on consumer devices for Iberian languages by evaluating a suite of compact LLMs on end-user hardware. It adopts unified benchmarks (Iberobench and a multilingual IroSvA-based irony dataset) to assess reading, translation, math, QA, NLI, paraphrasing, and irony tasks in Catalan, Spanish, Basque, Galician, and Portuguese. Key findings show that models like Gemma-2-9B and Mistral-7B variants achieve strong multilingual performance, while Basque remains notably challenging; instruction tuning generally improves comprehension and QA tasks, whereas distillation can trade off language-specific strength for efficiency. The study provides practical insights into on-device NLP deployment, highlighting the need for targeted pretraining and robust evaluation to close gaps in under-resourced languages. These results inform future development of compact multilingual models and deployment strategies for resource-constrained devices.

Abstract

Large Language Models have significantly advanced natural language processing, achieving remarkable performance in tasks such as language generation, translation, and reasoning. However, their substantial computational requirements restrict deployment to high-end systems, limiting accessibility on consumer-grade devices. This challenge is especially pronounced for under-resourced languages like those spoken in the Iberian Peninsula, where relatively limited linguistic resources and benchmarks hinder effective evaluation. This work presents a comprehensive evaluation of compact state-of-the-art LLMs across several essential NLP tasks tailored for Iberian languages. The results reveal that while some models consistently excel in certain tasks, significant performance gaps remain, particularly for languages such as Basque. These findings highlight the need for further research on balancing model compactness with robust multilingual performance

Paper Structure

This paper contains 22 sections, 2 figures, 9 tables.

Figures (2)

  • Figure 1: Salamandra-7B comparison
  • Figure 2: Latxa comparison