Table of Contents
Fetching ...

Using Contextual Information for Sentence-level Morpheme Segmentation

Prabin Bhandari, Abhishek Paudel

TL;DR

This work addresses the gap in morpheme segmentation by exploiting sentence-level context through a sequence-to-sequence Transformer that treats each sentence as a single training example. It employs SentencePiece tokenization with subword regularization, a Transformer with 6 encoder/decoder layers, and entmax loss, exploring both monolingual and multilingual setups, plus data augmentation and upsampling. The multilingual model generally outperforms monolinguals for Czech and Mongolian, with English remaining strong; however, the approach does not surpass current state-of-the-art scores, though results are competitive for high-resource languages and show promise for low-resource languages. Future work includes semi-supervised data expansion and language identifier tokens to further boost multilingual performance.

Abstract

Recent advancements in morpheme segmentation primarily emphasize word-level segmentation, often neglecting the contextual relevance within the sentence. In this study, we redefine the morpheme segmentation task as a sequence-to-sequence problem, treating the entire sentence as input rather than isolating individual words. Our findings reveal that the multilingual model consistently exhibits superior performance compared to monolingual counterparts. While our model did not surpass the performance of the current state-of-the-art, it demonstrated comparable efficacy with high-resource languages while revealing limitations in low-resource language scenarios.

Using Contextual Information for Sentence-level Morpheme Segmentation

TL;DR

This work addresses the gap in morpheme segmentation by exploiting sentence-level context through a sequence-to-sequence Transformer that treats each sentence as a single training example. It employs SentencePiece tokenization with subword regularization, a Transformer with 6 encoder/decoder layers, and entmax loss, exploring both monolingual and multilingual setups, plus data augmentation and upsampling. The multilingual model generally outperforms monolinguals for Czech and Mongolian, with English remaining strong; however, the approach does not surpass current state-of-the-art scores, though results are competitive for high-resource languages and show promise for low-resource languages. Future work includes semi-supervised data expansion and language identifier tokens to further boost multilingual performance.

Abstract

Recent advancements in morpheme segmentation primarily emphasize word-level segmentation, often neglecting the contextual relevance within the sentence. In this study, we redefine the morpheme segmentation task as a sequence-to-sequence problem, treating the entire sentence as input rather than isolating individual words. Our findings reveal that the multilingual model consistently exhibits superior performance compared to monolingual counterparts. While our model did not surpass the performance of the current state-of-the-art, it demonstrated comparable efficacy with high-resource languages while revealing limitations in low-resource language scenarios.
Paper Structure (15 sections, 3 tables)