One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Sebastian Nehrdich; Oliver Hellwig; Kurt Keutzer

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Sebastian Nehrdich, Oliver Hellwig, Kurt Keutzer

TL;DR

It is demonstrated that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

Abstract

Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

TL;DR

Abstract

Paper Structure (15 sections, 3 figures, 8 tables)

This paper contains 15 sections, 3 figures, 8 tables.

Introduction
Related Research
Data
Fine-tuning Dataset
Proposed Method
Experiments
Evaluation on Previous Sanskrit Word Segmentation Tasks
Vedic Dependency Parsing
Sanskrit OCR Post-correction
Lemmatization and Dependency Parsing on other MLR Languages
Joint Sanskrit Word Segmentation, Lemmatization and Morpho-syntax Tagging Task
Error analysis
Ablation Study
Conclusion and Future Work
Limitations

Figures (3)

Figure 1: Serialization for the morphosyntactic tagging task. The abbreviated tags are highlighted in red. We use spaces as separation token between words.
Figure 2: Sanskrit Multitask Formulation: All tasks are converted into sequence-generation tasks. For each task, we prepend prompt tokens (S, L, LM, here marked in red) in order to enable the model to distinguish between tasks. For efficient training and inference, we use a novel serialization strategy to compress the morphosyntactic tags into as few characters as possible, here marked in blue.
Figure 3: Results of the detailed error analysis

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

TL;DR

Abstract

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)