Table of Contents
Fetching ...

A symbolic Perl algorithm for the unification of Nahuatl word spellings

Juan-José Guzmán-Landa, Jesús Vázquez-Osorio, Juan-Manuel Torres-Moreno, Ligia Quintana Torres, Miguel Figueroa-Saavedra, Martha-Lorena Avendaño-Garrido, Graham Ranger, Patricia Velázquez-Morales, Gerardo Eugenio Sierra Martínez

TL;DR

The paper tackles the NLP challenges of Nahuatl by addressing polyorthography with a symbolic unification strategy that encodes linguistic normalization rules as regular-expression patterns. Using the $π$-yalli corpus, it evaluates the unification’s impact on semantic similarity tasks at both word and sentence levels, comparing large language models and static embeddings. Results show that unigraphy generally improves performance, with static models approaching or surpassing some LLMs in certain settings and tasks, and the best gains observed with enhanced preprocessing. The work demonstrates the practical value of standardized orthography for low-resource languages and outlines future directions to extend rules and hybridize symbolic and neural approaches, with code available on GitHub.

Abstract

In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called $π$-yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences

A symbolic Perl algorithm for the unification of Nahuatl word spellings

TL;DR

The paper tackles the NLP challenges of Nahuatl by addressing polyorthography with a symbolic unification strategy that encodes linguistic normalization rules as regular-expression patterns. Using the -yalli corpus, it evaluates the unification’s impact on semantic similarity tasks at both word and sentence levels, comparing large language models and static embeddings. Results show that unigraphy generally improves performance, with static models approaching or surpassing some LLMs in certain settings and tasks, and the best gains observed with enhanced preprocessing. The work demonstrates the practical value of standardized orthography for low-resource languages and outlines future directions to extend rules and hybridize symbolic and neural approaches, with code available on GitHub.

Abstract

In this paper, we describe a symbolic model for the automatic orthographic unification of Nawatl text documents. Our model is based on algorithms that we have previously used to analyze sentences in Nawatl, and on the corpus called -yalli, consisting of texts in several Nawatl orthographies. Our automatic unification algorithm implements linguistic rules in symbolic regular expressions. We also present a manual evaluation protocol that we have proposed and implemented to assess the quality of the unified sentences generated by our algorithm, by testing in a sentence semantic task. We have obtained encouraging results from the evaluators for most of the desired features of our artificially unified sentences

Paper Structure

This paper contains 9 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Top: Consonant Phonemes in the Nawatl Language. Bottom: Vowel Phonemes in the Nawatl Language monzon.
  • Figure 2: Phoneme–Grapheme Correspondences in Nawatl Writing. Tableau inspired from saavedra2024amapowalistli.
  • Figure 3: Relationship between number of LSA components (in blue embeddings addition, in orange lineal addition of embeddings) and average Kendall's $\tau$ (raw text sentences (top) and with unigraphy (bottom)).
  • Figure 4: Relationship between number of LSA components (embeddings addition, lineal addition of embeddings, raw text sentences and with unigraphy) and average Kendall's $\tau$ without 4 stopwords ('iwan', 'in', 'tlen', 'ipan').