Table of Contents
Fetching ...

A Rule-based Computational Model for Gaidhlig Morphology

Peter J Barclay

TL;DR

This work addresses the scarcity of large annotated corpora for Gaidhlig by proposing a rule-based morphology model derived from Wiktionary data. The approach converts Wiktionary entries into a standardized vocabulary format (SVF), loads them into a relational database, and builds Python utilities to generate inflected forms via a declarative rule base. Early results demonstrate the utility of SQL-driven pattern analysis and the expansion of lexeme recognition when inflected forms are included, highlighting the potential for teaching materials and rule-based parsers. Overall, the study argues that linguistically grounded, data-efficient rule-based methods can complement neural models for low-resource Celtic languages, enabling interpretable morphology, educational tooling, and scalable repurposing of semi-structured data.

Abstract

Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.

A Rule-based Computational Model for Gaidhlig Morphology

TL;DR

This work addresses the scarcity of large annotated corpora for Gaidhlig by proposing a rule-based morphology model derived from Wiktionary data. The approach converts Wiktionary entries into a standardized vocabulary format (SVF), loads them into a relational database, and builds Python utilities to generate inflected forms via a declarative rule base. Early results demonstrate the utility of SQL-driven pattern analysis and the expansion of lexeme recognition when inflected forms are included, highlighting the potential for teaching materials and rule-based parsers. Overall, the study argues that linguistically grounded, data-efficient rule-based methods can complement neural models for low-resource Celtic languages, enabling interpretable morphology, educational tooling, and scalable repurposing of semi-structured data.

Abstract

Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
Paper Structure (20 sections, 1 equation, 3 figures, 3 tables)

This paper contains 20 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Data preparation pipeline.
  • Figure 2: Wiktionary JSON entry for bàta (a boat).
  • Figure 3: Text coverage of first 15 stopwords.