Table of Contents
Fetching ...

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

María Grandury

TL;DR

The paper presents the #Somos600M Project, which directly addresses the scarcity of open instruction-tuning data for Spanish varieties and co-official languages by launching an international, open-source effort to create large-scale instruction datasets and an open generative LLM leaderboard. It combines a 2024 hackathon to generate synthetic instruction data for up to 7B-parameter LLMs, a dataset collection campaign focused on dialectal Spanish and co-official languages, and translation validation to enable multilingual evaluation. The results include 2.33 million instruction examples across diverse domains and an initial leaderboard comprising donated and translated evaluation datasets, validated by a broad community. The work demonstrates a scalable, community-driven approach to democratizing NLP resources for Spanish-speaking communities and lays the groundwork for expanding coverage to ethical and linguistic assessments and additional co-official languages.

Abstract

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.

The #Somos600M Project: Generating NLP resources that represent the diversity of the languages from LATAM, the Caribbean, and Spain

TL;DR

The paper presents the #Somos600M Project, which directly addresses the scarcity of open instruction-tuning data for Spanish varieties and co-official languages by launching an international, open-source effort to create large-scale instruction datasets and an open generative LLM leaderboard. It combines a 2024 hackathon to generate synthetic instruction data for up to 7B-parameter LLMs, a dataset collection campaign focused on dialectal Spanish and co-official languages, and translation validation to enable multilingual evaluation. The results include 2.33 million instruction examples across diverse domains and an initial leaderboard comprising donated and translated evaluation datasets, validated by a broad community. The work demonstrates a scalable, community-driven approach to democratizing NLP resources for Spanish-speaking communities and lays the groundwork for expanding coverage to ethical and linguistic assessments and additional co-official languages.

Abstract

We are 600 million Spanish speakers. We launched the #Somos600M Project because the diversity of the languages from LATAM, the Caribbean and Spain needs to be represented in Artificial Intelligence (AI) systems. Despite being the 7.5% of the world population, there is no open dataset to instruction-tune large language models (LLMs), nor a leaderboard to evaluate and compare them. In this paper, we present how we have created as an international open-source community the first versions of the instruction and evaluation datasets, indispensable resources for the advancement of Natural Language Processing (NLP) in our languages.
Paper Structure (32 sections, 10 figures, 10 tables)

This paper contains 32 sections, 10 figures, 10 tables.

Figures (10)

  • Figure 1: Instruction datasets generated during the #Somos600M Hackathon grouped by domain.
  • Figure 2: Tasks and languages (ES: Spanish, CA: Catalan and EU: Euskera) of the evaluation datasets of the first version of the open generative LLM leaderboard.
  • Figure 3: Cumulative number of monolingual English (orange) and Spanish (blue) datasets in the Hugging Face Hub over time until May 13 2024.
  • Figure 4: Location of the #Somos600M Hackathon participants.
  • Figure 5: Wordcloud of the occupations of the Hackathon #Somos600M participants translated to English.
  • ...and 5 more figures