Table of Contents
Fetching ...

A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

Juan-José Guzmán-Landa, Juan-Manuel Torres-Moreno, Miguel Figueroa-Saavedra, Ligia Quintana-Torres, Martha-Lorena Avendaño-Garrido, Graham Ranger

TL;DR

This work tackles the scarcity of computational resources for Nawatl by introducing a non-recursive context-free micro-grammar, μgnaw⊕0, designed to generate syntactically valid Nawatl sentences for corpus augmentation. By filtering the synthetic output and merging it with the authentic π-yalli corpus, the authors create π-yall-ia⊕0, an expanded resource used to train static embeddings and evaluate sentence-level semantic similarity. The augmented approach yields a Kendall’s τ of up to 0.527 for FastText against top LLMs, demonstrating the potential of synthetic data to bolster low-resource languages, while highlighting the need for richer grammars and improved semantic filters to realize larger gains. Future work plans to extend grammatical coverage (more persons, tenses, plurals), introduce recursive grammars, and broaden downstream tasks like sentiment, summarization, and NER.

Abstract

In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the $π$-language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call $π$-\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

A First Context-Free Grammar Applied to Nawatl Corpora Augmentation

TL;DR

This work tackles the scarcity of computational resources for Nawatl by introducing a non-recursive context-free micro-grammar, μgnaw⊕0, designed to generate syntactically valid Nawatl sentences for corpus augmentation. By filtering the synthetic output and merging it with the authentic π-yalli corpus, the authors create π-yall-ia⊕0, an expanded resource used to train static embeddings and evaluate sentence-level semantic similarity. The augmented approach yields a Kendall’s τ of up to 0.527 for FastText against top LLMs, demonstrating the potential of synthetic data to bolster low-resource languages, while highlighting the need for richer grammars and improved semantic filters to realize larger gains. Future work plans to extend grammatical coverage (more persons, tenses, plurals), introduce recursive grammars, and broaden downstream tasks like sentiment, summarization, and NER.

Abstract

In this article we introduce a context-free grammar (CFG) for the Nawatl language. Nawatl (or Nahuatl) is an Amerindian language of the -language type, i.e. a language with few digital resources, in which the corpora available for machine learning are virtually non-existent. The objective here is to generate a significant number of grammatically correct artificial sentences, in order to increase the corpora available for language model training. We want to show that a grammar enables us significantly to expand a corpus in Nawatl which we call -\textsc{yalli}. The corpus, thus enriched, enables us to train algorithms such as FastText and to evaluate them on sentence-level semantic tasks. Preliminary results show that by using the grammar, comparative improvements are achieved over some LLMs. However, it is observed that to achieve more significant improvement, grammars that model the Nawatl language even more effectively are required.

Paper Structure

This paper contains 13 sections, 1 equation, 1 figure, 6 tables.

Figures (1)

  • Figure 1: Outline of the Sentences Semantic Similarity task.