Table of Contents
Fetching ...

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

Andrés Lou, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena

TL;DR

This work addresses the scarcity of accessible, informal-register parallel data for Mayan languages by creating MayanV, a publicly released collection of Mayan–Spanish corpora curated from native sources. It analyzes dialectal variation in Spanish within these corpora and evaluates a range of NMT setups, including bilingual, multilingual, and fine-tuned large models (NLLB-200), demonstrating that MayanV substantially improves translation quality over baselines. The results emphasize the importance of domain- and register-appropriate data for accurate MT in low-resource, underrepresented languages and highlight the potential of multilingual transfer, albeit constrained by data size. The work provides a foundation for more representative MT tools for Mayan languages and advocates for ongoing data collection and model adaptation to rural, everyday language use.

Abstract

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.

Curated Datasets and Neural Models for Machine Translation of Informal Registers between Mayan and Spanish Vernaculars

TL;DR

This work addresses the scarcity of accessible, informal-register parallel data for Mayan languages by creating MayanV, a publicly released collection of Mayan–Spanish corpora curated from native sources. It analyzes dialectal variation in Spanish within these corpora and evaluates a range of NMT setups, including bilingual, multilingual, and fine-tuned large models (NLLB-200), demonstrating that MayanV substantially improves translation quality over baselines. The results emphasize the importance of domain- and register-appropriate data for accurate MT in low-resource, underrepresented languages and highlight the potential of multilingual transfer, albeit constrained by data size. The work provides a foundation for more representative MT tools for Mayan languages and advocates for ongoing data collection and model adaptation to rural, everyday language use.

Abstract

The Mayan languages comprise a language family with an ancient history, millions of speakers, and immense cultural value, that, nevertheless, remains severely underrepresented in terms of resources and global exposure. In this paper we develop, curate, and publicly release a set of corpora in several Mayan languages spoken in Guatemala and Southern Mexico, which we call MayanV. The datasets are parallel with Spanish, the dominant language of the region, and are taken from official native sources focused on representing informal, day-to-day, and non-domain-specific language. As such, and according to our dialectometric analysis, they differ in register from most other available resources. Additionally, we present neural machine translation models, trained on as many resources and Mayan languages as possible, and evaluated exclusively on our datasets. We observe lexical divergences between the dialects of Spanish in our resources and the more widespread written standard of Spanish, and that resources other than the ones we present do not seem to improve translation performance, indicating that many such resources may not accurately capture common, real-life language usage. The MayanV dataset is available at https://github.com/transducens/mayanv.
Paper Structure (11 sections, 2 equations, 3 figures, 8 tables)

This paper contains 11 sections, 2 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Sample of the ancient Mayan script, reading b'alam, "jaguar", using a combination of the logogram and the syllabogram. Attribution: Goran tek-en under license CC BY-SA 4.0.
  • Figure 2: The Mayan linguistic communities of Guatemala
  • Figure 3: (\ref{['fig:kek_vocab']}) Entry in the Q'eqchi' corpus of MayanV. The Q'eqchi' term is in bold font; the first set of italics is the Spanish translation, "loanword"; the regular text is a usage example of the term; the second set of italics is the Spanish translation of the example. (\ref{['fig:kek_extracted']}) Extracted Q'eqchi' sentence and its Spanish translation: "There are many loanwords being introduced into our language."