Table of Contents
Fetching ...

Tenyidie Syllabification corpus creation and deep learning applications

Teisovi Angami, Kevisino Khate

TL;DR

This work tackles syllabification for a low-resource Tibeto-Burman language, Tenyidie, by building a manually annotated corpus of $10{,}120$ words and evaluating several DL sequence-labeling approaches. A BLSTM model achieves the best performance with $99.21\%$ word-level accuracy on a held-out test set, demonstrating the feasibility of data-driven syllabification in this language. The corpus supports downstream NLP tasks such as morphological analysis, POS tagging, and machine translation for Tenyidie, and provides linguistic insight into syllable types and root-initial clusters. Future work will expand the dataset and explore encoder-decoder architectures on larger data to further enhance performance.

Abstract

The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language. Keywords: Tenyidie; NLP; syllabification; deep learning; LSTM; BLSTM; CRF; Encoder-decoder

Tenyidie Syllabification corpus creation and deep learning applications

TL;DR

This work tackles syllabification for a low-resource Tibeto-Burman language, Tenyidie, by building a manually annotated corpus of words and evaluating several DL sequence-labeling approaches. A BLSTM model achieves the best performance with word-level accuracy on a held-out test set, demonstrating the feasibility of data-driven syllabification in this language. The corpus supports downstream NLP tasks such as morphological analysis, POS tagging, and machine translation for Tenyidie, and provides linguistic insight into syllable types and root-initial clusters. Future work will expand the dataset and explore encoder-decoder architectures on larger data to further enhance performance.

Abstract

The Tenyidie language is a low-resource language of the Tibeto-Burman family spoken by the Tenyimia Community of Nagaland in the north-eastern part of India and is considered a major language in Nagaland. It is tonal, Subject-Object-Verb, and highly agglutinative in nature. Being a low-resource language, very limited research on Natural Language Processing (NLP) has been conducted. To the best of our knowledge, no work on syllabification has been reported for this language. Among the many NLP tasks, syllabification or syllabication is an important task in which the given word syllables are identified. The contribution of this work is the creation of 10,120 syllabified Tenyidie words and the application of the Deep Learning techniques on the created corpus. In this paper, we have applied LSTM, BLSTM, BLSTM+CRF, and Encoder-decoder deep learning architectures on our created dataset. In our dataset split of 80:10:10 (train:validation:test) set, we achieved the highest accuracy of 99.21% with BLSTM model on the test set. This work will find its application in numerous other NLP applications, such as morphological analysis, part-of-speech tagging, machine translation, etc, for the Tenyidie Language. Keywords: Tenyidie; NLP; syllabification; deep learning; LSTM; BLSTM; CRF; Encoder-decoder

Paper Structure

This paper contains 10 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Tenyidie manual syllabification methodology
  • Figure 2: Tenyidie syllabified corpus syllable types vs. frequency
  • Figure 3: Syllable distribution for beginning positioned syllables
  • Figure 4: Syllable distribution for middle positioned syllables
  • Figure 5: Syllable distribution for end positioned syllables
  • ...and 14 more figures