Table of Contents
Fetching ...

Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

Belu Ticona, Fernando Carranza, Viviana Cotik

TL;DR

This work surveys the Indigenous languages spoken in Argentina, grouping them into seven families (Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon) and estimating speaker numbers from the national INDEC census while discussing design limitations that likely undercount speakers. It then inventories NLP and speech resources, noting that most resources come from outside Argentina and that Argentine varieties are underrepresented, with detailed subsections per family profiling available corpora, tasks, and notable projects (e.g., Mapudungún AVENUE corpus, Guaraní Jojajovai, Quechua QU resources, Chon corpus). The paper also provides methodological details in the appendix and highlights trends, challenges, and ethical considerations for community-centered development. Overall, the study establishes a baseline of linguistic diversity and resource availability, identifies gaps for local varieties, and offers direction for future, community-engaged NLP work to support language revitalization in Argentina.

Abstract

Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.

Indigenous Languages Spoken in Argentina: A Survey of NLP and Speech Resources

TL;DR

This work surveys the Indigenous languages spoken in Argentina, grouping them into seven families (Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon) and estimating speaker numbers from the national INDEC census while discussing design limitations that likely undercount speakers. It then inventories NLP and speech resources, noting that most resources come from outside Argentina and that Argentine varieties are underrepresented, with detailed subsections per family profiling available corpora, tasks, and notable projects (e.g., Mapudungún AVENUE corpus, Guaraní Jojajovai, Quechua QU resources, Chon corpus). The paper also provides methodological details in the appendix and highlights trends, challenges, and ethical considerations for community-centered development. Overall, the study establishes a baseline of linguistic diversity and resource availability, identifies gaps for local varieties, and offers direction for future, community-engaged NLP work to support language revitalization in Argentina.

Abstract

Argentina has a large yet little-known Indigenous linguistic diversity, encompassing at least 40 different languages. The majority of these languages are at risk of disappearing, resulting in a significant loss of world heritage and cultural knowledge. Currently, unified information on speakers and computational tools is lacking for these languages. In this work, we present a systematization of the Indigenous languages spoken in Argentina, classifying them into seven language families: Mapuche, Tupí-Guaraní, Guaycurú, Quechua, Mataco-Mataguaya, Aymara, and Chon. For each one, we present an estimation of the national Indigenous population size, based on the most recent Argentinian census. We discuss potential reasons why the census questionnaire design may underestimate the actual number of speakers. We also provide a concise survey of computational resources available for these languages, whether or not they were specifically developed for Argentinian varieties.
Paper Structure (10 sections, 1 figure, 2 tables)

This paper contains 10 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Geographical and demographic description of Indigenous languages spoken in Argentina by more than 1000 speakers INDEC2024. To the left, the distribution of the Indigenous communities homonymous to Indigenous languages. To the right, Indigenous languages spoken in Argentina, grouped by family and identified by their ISO 639-3 code (macrolanguage identifier, and microlanguage in cases when more than one variety is mentioned). For each case, the population and the corresponding percentage of speakers are provided, with the percentage reflecting the proportion of the community that speaks the Indigenous language they consider to be the language of their community, as reported in the Argentine INDEC census. All data was obtained from the national census INDEC2024. Speaking population shows the estimated number of speakers of the language based on the Indigenous Population and the previously described percentage. Vitality shows the level of endangerment of the language according to Ethnologue eberhard2015ethnologue: *, and denote stability, endangerment and institutional status of the language, respectively. When there is no available data to distinguish between varieties associated with a given language, the data is placed in the middle of the lines. Aymara is not shown in the map, because there is no community data from Aymara in the INDEC data.