Table of Contents
Fetching ...

Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

Mamadou K. Keita, Christopher Homan, Marcos Zampieri, Adwoa Bremang, Habibatou Abdoulaye Alfari, Elysabhete Amadou Ibrahim, Dennis Owusu

TL;DR

This study systematically compares rule-based, MT-based, and LLM-based GEC approaches for Zarma, a low-resource West African language, using a large synthetic plus human-annotated dataset (>250,000 examples) and validating cross-language generalization with Bambara. The MT-based approach using M2M100 delivers the strongest performance in both automatic and manual evaluations, achieving high spelling correction efficacy and robust grammar improvements, while rule-based methods excel at spelling Detection but struggle with context-level corrections. LLM-based methods show moderate results and benefit from targeted prompting and tuning, but fall short of MT-based performance in this setting. The work demonstrates the practicality of MT models for low-resource GEC, and its replication in Bambara suggests broader applicability, with potential applications in education, content creation, and cultural documentation; future work proposes hybrid systems, data augmentation, and cross-lingual transfer to extend reach to more languages.

Abstract

Grammatical error correction (GEC) aims to improve quality and readability of texts through accurate correction of linguistic mistakes. Previous work has focused on high-resource languages, while low-resource languages lack robust tools. However, low-resource languages often face problems such as: non-standard orthography, limited annotated corpora, and diverse dialects, which slows down the development of GEC tools. We present a study on GEC for Zarma, spoken by over five million in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated them using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95. 82% and a suggestion accuracy of 78. 90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs -- MT5-small -- showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

Grammatical Error Correction for Low-Resource Languages: The Case of Zarma

TL;DR

This study systematically compares rule-based, MT-based, and LLM-based GEC approaches for Zarma, a low-resource West African language, using a large synthetic plus human-annotated dataset (>250,000 examples) and validating cross-language generalization with Bambara. The MT-based approach using M2M100 delivers the strongest performance in both automatic and manual evaluations, achieving high spelling correction efficacy and robust grammar improvements, while rule-based methods excel at spelling Detection but struggle with context-level corrections. LLM-based methods show moderate results and benefit from targeted prompting and tuning, but fall short of MT-based performance in this setting. The work demonstrates the practicality of MT models for low-resource GEC, and its replication in Bambara suggests broader applicability, with potential applications in education, content creation, and cultural documentation; future work proposes hybrid systems, data augmentation, and cross-lingual transfer to extend reach to more languages.

Abstract

Grammatical error correction (GEC) aims to improve quality and readability of texts through accurate correction of linguistic mistakes. Previous work has focused on high-resource languages, while low-resource languages lack robust tools. However, low-resource languages often face problems such as: non-standard orthography, limited annotated corpora, and diverse dialects, which slows down the development of GEC tools. We present a study on GEC for Zarma, spoken by over five million in West Africa. We compare three approaches: rule-based methods, machine translation (MT) models, and large language models (LLMs). We evaluated them using a dataset of more than 250,000 examples, including synthetic and human-annotated data. Our results showed that the MT-based approach using M2M100 outperforms others, with a detection rate of 95. 82% and a suggestion accuracy of 78. 90% in automatic evaluations (AE) and an average score of 3.0 out of 5.0 in manual evaluation (ME) from native speakers for grammar and logical corrections. The rule-based method was effective for spelling errors but failed on complex context-level errors. LLMs -- MT5-small -- showed moderate performance. Our work supports use of MT models to enhance GEC in low-resource settings, and we validated these results with Bambara, another West African language.

Paper Structure

This paper contains 40 sections, 6 equations, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Rule-Based GEC tool Workflow
  • Figure 2: Images of the different GEC tool interfaces. The rule-based on the left and the other approaches on the right