Table of Contents
Fetching ...

Classifying Graphemes in English Words Through the Application of a Fuzzy Inference System

Samuel Rose, Chandrasekhar Kambhampati

TL;DR

The paper addresses grapheme segmentation in English by proposing a fuzzy inference system (FIS) to predict grapheme counts from word length, vowels, and consonants. It constructs fuzzy sets based on observed grapheme-count statistics, employing Gaussian membership functions and centroid defuzzification within a Mamdani framework to output predicted grapheme counts in the range $[1,14]$, which are then used to map words into graphemes. Compared with IPA-mapping, the FIS approach is less dependent on dictionaries and thus more robust to spelling irregularities, though it shows lower raw accuracy in predicting the exact grapheme count on training data but comparable or better alignment for dialect-specific mappings (e.g., British RP). Overall, the study demonstrates that grapheme-count trends can be effectively approximated with fuzzy logic, offering a viable method for phonological word analysis in NLP and NLG applications, while highlighting dialectal variation as a key factor in precision.

Abstract

In Linguistics, a grapheme is a written unit of a writing system corresponding to a phonological sound. In Natural Language Processing tasks, written language is analysed through two different mediums, word analysis, and character analysis. This paper focuses on a third approach, the analysis of graphemes. Graphemes have advantages over word and character analysis by being self-contained representations of phonetic sounds. Due to the nature of splitting a word into graphemes being based on complex, non-binary rules, the application of fuzzy logic would provide a suitable medium upon which to predict the number of graphemes in a word. This paper proposes the application of a Fuzzy Inference System to split words into their graphemes. This Fuzzy Inference System results in a correct prediction of the number of graphemes in a word 50.18% of the time, with 93.51% being within a margin of +- 1 from the correct classification. Given the variety in language, graphemes are tied with pronunciation and therefore can change depending on a regional accent/dialect, the +- 1 accuracy represents the impreciseness of grapheme classification when regional variances are accounted for. To give a baseline of comparison, a second method involving a recursive IPA mapping exercise using a pronunciation dictionary was developed to allow for comparisons to be made.

Classifying Graphemes in English Words Through the Application of a Fuzzy Inference System

TL;DR

The paper addresses grapheme segmentation in English by proposing a fuzzy inference system (FIS) to predict grapheme counts from word length, vowels, and consonants. It constructs fuzzy sets based on observed grapheme-count statistics, employing Gaussian membership functions and centroid defuzzification within a Mamdani framework to output predicted grapheme counts in the range , which are then used to map words into graphemes. Compared with IPA-mapping, the FIS approach is less dependent on dictionaries and thus more robust to spelling irregularities, though it shows lower raw accuracy in predicting the exact grapheme count on training data but comparable or better alignment for dialect-specific mappings (e.g., British RP). Overall, the study demonstrates that grapheme-count trends can be effectively approximated with fuzzy logic, offering a viable method for phonological word analysis in NLP and NLG applications, while highlighting dialectal variation as a key factor in precision.

Abstract

In Linguistics, a grapheme is a written unit of a writing system corresponding to a phonological sound. In Natural Language Processing tasks, written language is analysed through two different mediums, word analysis, and character analysis. This paper focuses on a third approach, the analysis of graphemes. Graphemes have advantages over word and character analysis by being self-contained representations of phonetic sounds. Due to the nature of splitting a word into graphemes being based on complex, non-binary rules, the application of fuzzy logic would provide a suitable medium upon which to predict the number of graphemes in a word. This paper proposes the application of a Fuzzy Inference System to split words into their graphemes. This Fuzzy Inference System results in a correct prediction of the number of graphemes in a word 50.18% of the time, with 93.51% being within a margin of +- 1 from the correct classification. Given the variety in language, graphemes are tied with pronunciation and therefore can change depending on a regional accent/dialect, the +- 1 accuracy represents the impreciseness of grapheme classification when regional variances are accounted for. To give a baseline of comparison, a second method involving a recursive IPA mapping exercise using a pronunciation dictionary was developed to allow for comparisons to be made.
Paper Structure (13 sections, 1 equation, 7 figures, 3 tables)

This paper contains 13 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Probabilities of Grapheme count compared to word length
  • Figure 2: Frequency Sets for Grapheme count compared to word length
  • Figure 3: Membership Functions for Grapheme count compared to word length
  • Figure 4: Membership Functions for Grapheme count compared to a words vowel count
  • Figure 5: Membership Functions for Grapheme count compared to a words consonant count
  • ...and 2 more figures