Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

Sushant Dave; Arun Kumar Singh; Prathosh A. P.; Brejesh Lall

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

Sushant Dave, Arun Kumar Singh, Prathosh A. P., Brejesh Lall

TL;DR

The paper tackles Sandhi generation and Sandhi Split in Sanskrit using fully data-driven sequence-to-sequence neural models without external lexical resources. It frames Sandhi as a translation-like task and decomposes Sandhi Split into a two-stage process: identifying the sandhi-window and then splitting it with a seq2seq model, aided by a truncation strategy (n=$5$, m=$2$). Evaluated on the UoH Sanskrit corpus, the approach outperforms several public baselines and other neural architectures, while remaining simpler and faster due to the absence of attention mechanisms. The work provides open-source code and demonstrates strong potential for scalable Sanskrit morphological processing and downstream NLP tasks, with future work to extend to internal Sandhi and improve data quality.

Abstract

This paper describes neural network based approaches to the process of the formation and splitting of word-compounding, respectively known as the Sandhi and Vichchhed, in Sanskrit language. Sandhi is an important idea essential to morphological analysis of Sanskrit texts. Sandhi leads to word transformations at word boundaries. The rules of Sandhi formation are well defined but complex, sometimes optional and in some cases, require knowledge about the nature of the words being compounded. Sandhi split or Vichchhed is an even more difficult task given its non uniqueness and context dependence. In this work, we propose the route of formulating the problem as a sequence to sequence prediction task, using modern deep learning techniques. Being the first fully data driven technique, we demonstrate that our model has an accuracy better than the existing methods on multiple standard datasets, despite not using any additional lexical or morphological resources. The code is being made available at https://github.com/IITD-DataScience/Sandhi_Prakarana

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

TL;DR

Abstract

Neural Compound-Word (Sandhi) Generation and Splitting in Sanskrit Language

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)