Table of Contents
Fetching ...

An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

Wazir Ali, Saifullah Tumrani, Jay Kumar, Tariq Rahim Soomro

TL;DR

This work tackles the scarcity of Sindhi language resources by assembling a large unlabeled corpus (>61 million words) from diverse web sources and applying three embedding paradigms (GloVe, CBoW, Skip-gram). Through a designed preprocessing pipeline and a curated Sindhi stop-word list, the authors train embeddings and evaluate them with both intrinsic and extrinsic tasks, consistently finding that continuous-bag-of-words and skip-gram models outperform GloVe and existing Sindhi fastText embeddings. Intrinsic assessments (nearest neighbors, word-pair relationships, WordSim-347) and extrinsic tasks (POS tagging and NER on SiPOS/SiNER) show that SG-based embeddings capture stronger semantic/syntactic relationships and yield better downstream performance. The study contributes a sizable Sindhi corpus, a stop-word list, and a comparative analysis of embedding methods, with implications for improved Sindhi NLP applications and potential extensions to WordNet construction and contextualized models like BERT/ELMo/GPT.

Abstract

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches

An Evaluation of Sindhi Word Embedding in Semantic Analogies and Downstream Tasks

TL;DR

This work tackles the scarcity of Sindhi language resources by assembling a large unlabeled corpus (>61 million words) from diverse web sources and applying three embedding paradigms (GloVe, CBoW, Skip-gram). Through a designed preprocessing pipeline and a curated Sindhi stop-word list, the authors train embeddings and evaluate them with both intrinsic and extrinsic tasks, consistently finding that continuous-bag-of-words and skip-gram models outperform GloVe and existing Sindhi fastText embeddings. Intrinsic assessments (nearest neighbors, word-pair relationships, WordSim-347) and extrinsic tasks (POS tagging and NER on SiPOS/SiNER) show that SG-based embeddings capture stronger semantic/syntactic relationships and yield better downstream performance. The study contributes a sizable Sindhi corpus, a stop-word list, and a comparative analysis of embedding methods, with implications for improved Sindhi NLP applications and potential extensions to WordNet construction and contextualized models like BERT/ELMo/GPT.

Abstract

In this paper, we propose a new word embedding based corpus consisting of more than 61 million words crawled from multiple web resources. We design a preprocessing pipeline for the filtration of unwanted text from crawled data. Afterwards, the cleaned vocabulary is fed to state-of-the-art continuous-bag-of-words, skip-gram, and GloVe word embedding algorithms. For the evaluation of pretrained embeddings, we use popular intrinsic and extrinsic evaluation approaches. The evaluation results reveal that continuous-bag-of-words and skip-gram perform better than GloVe and existing Sindhi fastText word embedding on both intrinsic and extrinsic evaluation approaches
Paper Structure (22 sections, 4 figures, 11 tables)

This paper contains 22 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Visualization of Sindhi CBoW word embeddings
  • Figure 2: Visualization of the Sindhi SG word embeddings
  • Figure 3: Visualization of the Sindhi GloVe word embeddings
  • Figure 4: Visualization of the SdfastText word embeddings