Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources
Belu Ticona, Fernando Carranza, Viviana Cotik
TL;DR
This work surveys the Indigenous languages spoken in Argentina, grouping them into seven families (Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon) and estimating speaker numbers from the national INDEC census while discussing design limitations that likely undercount speakers. It then inventories NLP and speech resources, noting that most resources come from outside Argentina and that Argentine varieties are underrepresented, with detailed subsections per family profiling available corpora, tasks, and notable projects (e.g., Mapudungún AVENUE corpus, Guaraní Jojajovai, Quechua QU resources, Chon corpus). The paper also provides methodological details in the appendix and highlights trends, challenges, and ethical considerations for community-centered development. Overall, the study establishes a baseline of linguistic diversity and resource availability, identifies gaps for local varieties, and offers direction for future, community-engaged NLP work to support language revitalization in Argentina.
Abstract
Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
