Robust Language Identification for Romansh Varieties

Charlotte Model; Sina Ahmadi; Jannis Vamvas

Robust Language Identification for Romansh Varieties

Charlotte Model, Sina Ahmadi, Jannis Vamvas

Abstract

The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

Robust Language Identification for Romansh Varieties

Abstract

Paper Structure (20 sections, 1 figure, 6 tables)

This paper contains 20 sections, 1 figure, 6 tables.

Introduction
Romansh and its Varieties
Related Work
Language Identification
Romansh NLP
Data
Preprocessing
Named-Entity Masking
Experimental Setup
Data Splits
Classification
Hyperparameter Optimization
Results
Overall Results
Per Idiom Performances
...and 5 more sections

Figures (1)

Figure 1: Row-normalized confusion matrices for all test sets. The model achieves near-perfect classification on the balanced in-domain set (test-b), while confusion increases on out-of-domain data (test-d).

Robust Language Identification for Romansh Varieties

Abstract

Robust Language Identification for Romansh Varieties

Authors

Abstract

Table of Contents

Figures (1)