Table of Contents
Fetching ...

Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

Happymore Masoka

TL;DR

Shona spaCy addresses the under-representation of Shona in NLP by delivering a hybrid rule-based morphological analyzer integrated in spaCy. It implements a lexicon-driven annotation pipeline and explicit linguistic rules to capture noun class prefixes, verbal morphology, clitics, and ideophones, yielding interpretable token-level morphosyntax. On formal and informal Shona data, the system achieves high accuracy: POS 90.7% and morphological feature accuracy 88.3%, with strong lexical and rule coverage. This open-source tool advances digital inclusion for Shona speakers and offers a template for morphological analysis in other under-resourced Bantu languages.

Abstract

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

Shona spaCy: A Morphological Analyzer for an Under-Resourced Bantu Language

TL;DR

Shona spaCy addresses the under-representation of Shona in NLP by delivering a hybrid rule-based morphological analyzer integrated in spaCy. It implements a lexicon-driven annotation pipeline and explicit linguistic rules to capture noun class prefixes, verbal morphology, clitics, and ideophones, yielding interpretable token-level morphosyntax. On formal and informal Shona data, the system achieves high accuracy: POS 90.7% and morphological feature accuracy 88.3%, with strong lexical and rule coverage. This open-source tool advances digital inclusion for Shona speakers and offers a template for morphological analysis in other under-resourced Bantu languages.

Abstract

Despite rapid advances in multilingual natural language processing (NLP), the Bantu language Shona remains under-served in terms of morphological analysis and language-aware tools. This paper presents Shona spaCy, an open-source, rule-based morphological pipeline for Shona built on the spaCy framework. The system combines a curated JSON lexicon with linguistically grounded rules to model noun-class prefixes (Mupanda 1-18), verbal subject concords, tense-aspect markers, ideophones, and clitics, integrating these into token-level annotations for lemma, part-of-speech, and morphological features. The toolkit is available via pip install shona-spacy, with source code at https://github.com/HappymoreMasoka/shona-spacy and a PyPI release at https://pypi.org/project/shona-spacy/0.1.4/. Evaluation on formal and informal Shona corpora yields 90% POS-tagging accuracy and 88% morphological-feature accuracy, while maintaining transparency in its linguistic decisions. By bridging descriptive grammar and computational implementation, Shona spaCy advances NLP accessibility and digital inclusion for Shona speakers and provides a template for morphological analysis tools for other under-resourced Bantu languages.

Paper Structure

This paper contains 33 sections, 1 equation, 2 tables.