Table of Contents
Fetching ...

Open Conversational LLMs do not know most Spanish words

Javier Conde, Miguel González, Nina Melero, Raquel Ferrando, Gonzalo Martínez, Elena Merino-Gómez, José Alberto Hernández, Pedro Reviriego

TL;DR

This paper investigates how well open-source conversational LLMs understand and use Spanish vocabulary, addressing a gap in multilingual evaluation. It employs a dictionary-based test using 100 words from the DEA, with prompts that elicit meanings and contextual usage, complemented by yes/no prompts and ChatGPT-based checks, across 12 models of varying sizes and language focus. Findings show substantial lexical deficits: many models fail to produce correct meanings or to use words correctly in context, model size offers modest improvement, and Spanish-focused adaptations do not consistently enhance lexical knowledge. The study underscores the need for broader, scalable lexical testing and stronger multilingual lexical capabilities in open LLM ecosystems to promote linguistic fairness across languages.

Abstract

The growing interest in Large Language Models (LLMs) and in particular in conversational models with which users can interact has led to the development of a large number of open-source chat LLMs. These models are evaluated on a wide range of benchmarks to assess their capabilities in answering questions or solving problems on almost any possible topic or to test their ability to reason or interpret texts. Instead, the evaluation of the knowledge that these models have of the languages has received much less attention. For example, the words that they can recognize and use in different languages. In this paper, we evaluate the knowledge that open-source chat LLMs have of Spanish words by testing a sample of words in a reference dictionary. The results show that open-source chat LLMs produce incorrect meanings for an important fraction of the words and are not able to use most of the words correctly to write sentences with context. These results show how Spanish is left behind in the open-source LLM race and highlight the need to push for linguistic fairness in conversational LLMs ensuring that they provide similar performance across languages.

Open Conversational LLMs do not know most Spanish words

TL;DR

This paper investigates how well open-source conversational LLMs understand and use Spanish vocabulary, addressing a gap in multilingual evaluation. It employs a dictionary-based test using 100 words from the DEA, with prompts that elicit meanings and contextual usage, complemented by yes/no prompts and ChatGPT-based checks, across 12 models of varying sizes and language focus. Findings show substantial lexical deficits: many models fail to produce correct meanings or to use words correctly in context, model size offers modest improvement, and Spanish-focused adaptations do not consistently enhance lexical knowledge. The study underscores the need for broader, scalable lexical testing and stronger multilingual lexical capabilities in open LLM ecosystems to promote linguistic fairness across languages.

Abstract

The growing interest in Large Language Models (LLMs) and in particular in conversational models with which users can interact has led to the development of a large number of open-source chat LLMs. These models are evaluated on a wide range of benchmarks to assess their capabilities in answering questions or solving problems on almost any possible topic or to test their ability to reason or interpret texts. Instead, the evaluation of the knowledge that these models have of the languages has received much less attention. For example, the words that they can recognize and use in different languages. In this paper, we evaluate the knowledge that open-source chat LLMs have of Spanish words by testing a sample of words in a reference dictionary. The results show that open-source chat LLMs produce incorrect meanings for an important fraction of the words and are not able to use most of the words correctly to write sentences with context. These results show how Spanish is left behind in the open-source LLM race and highlight the need to push for linguistic fairness in conversational LLMs ensuring that they provide similar performance across languages.
Paper Structure (14 sections, 5 figures, 4 tables)

This paper contains 14 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Number of words failing on a number of models.
  • Figure 2: Number of words failing on each model.
  • Figure 3: Number of models failing per word versus word frequency (to represent words that do not appear in CREA, we plot them with a frequency of $10^{-3}$ as the frequency is represented in a logarithmic scale.).
  • Figure 4: Models failing per word (red for failure, green for correct meaning).
  • Figure 5: Number of models failing per word (red for failure, green for correct meaning).