Table of Contents
Fetching ...

Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention

Wazir Ali, Jay Kumar, Saifullah Tumrani, Redhwan Nour, Adeeb Noor, Zenglin Xu

TL;DR

This paper proposes a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task and incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field.

Abstract

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.

Enhancing Sindhi Word Segmentation using Subword Representation Learning and Position-aware Self-attention

TL;DR

This paper proposes a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task and incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field.

Abstract

Sindhi word segmentation is a challenging task due to space omission and insertion issues. The Sindhi language itself adds to this complexity. It's cursive and consists of characters with inherent joining and non-joining properties, independent of word boundaries. Existing Sindhi word segmentation methods rely on designing and combining hand-crafted features. However, these methods have limitations, such as difficulty handling out-of-vocabulary words, limited robustness for other languages, and inefficiency with large amounts of noisy or raw text. Neural network-based models, in contrast, can automatically capture word boundary information without requiring prior knowledge. In this paper, we propose a Subword-Guided Neural Word Segmenter (SGNWS) that addresses word segmentation as a sequence labeling task. The SGNWS model incorporates subword representation learning through a bidirectional long short-term memory encoder, position-aware self-attention, and a conditional random field. Our empirical results demonstrate that the SGNWS model achieves state-of-the-art performance in Sindhi word segmentation on six datasets.

Paper Structure

This paper contains 24 sections, 13 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: The architecture of the proposed SGNWS model. The input sentences $x_{1}, x_{2}, \dots x_{n}$, are converted into a sequence of character-level subword representations as input. BiLSTM network learn to how to obtain subword features for identifying word boundaries. The output of BiLSTM encoder layer is fed into self-attention layer before decoding. The notation Q=K=V=H signifies that the Query, Key, and Value vectors used in the attention mechanism are all derived from the same hidden layer (H), obtained from the output of a BiLSTM encoder layer. Finally, we employ CRF to obtain the predicted label sequence.
  • Figure 2: The training accuracy and loss of baseline and proposed SGNWS models on the SDSEG dataset.