Table of Contents
Fetching ...

Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages

Joanito Agili Lopo, Radius Tanone

TL;DR

The paper tackles the scarcity of NLP resources for Indonesia's local languages by constructing Bhinneka Korpus, a multilingual parallel corpus for Ambonese Malay, Beaye, Kupang Malay, Makassarese, and Uab Meto. It uses a participatory translation-circle approach, combining Tatoeba and NusaX Lexicon data, along with double-blind quality checks, to assemble 18,000 sentences and a first Beaye bilingual lexicon. An SMT baseline with IBM Model 1 and varying smoothing demonstrates varying translation performance across languages, with Beaye showing the most stable BLEU and Uab Meto the most challenging. The work provides insights into lexical phenomena, readability, and diversity in low-resource Indonesian languages and releases open resources to accelerate future multilingual NLP research in Indonesia.

Abstract

In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country. We explained in a detail the dataset collection process and associated challenges. Additionally, we experimented with translation task using the IBM Model 1 due to data constraints. The result showed that the performance of each language already shows good indications for further development. Challenges such as lexical variation, smoothing effects, and cross-linguistic variability are discussed. We intend to evaluate the corpus using advanced NLP techniques for low-resource languages, paving the way for multilingual translation models.

Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages

TL;DR

The paper tackles the scarcity of NLP resources for Indonesia's local languages by constructing Bhinneka Korpus, a multilingual parallel corpus for Ambonese Malay, Beaye, Kupang Malay, Makassarese, and Uab Meto. It uses a participatory translation-circle approach, combining Tatoeba and NusaX Lexicon data, along with double-blind quality checks, to assemble 18,000 sentences and a first Beaye bilingual lexicon. An SMT baseline with IBM Model 1 and varying smoothing demonstrates varying translation performance across languages, with Beaye showing the most stable BLEU and Uab Meto the most challenging. The work provides insights into lexical phenomena, readability, and diversity in low-resource Indonesian languages and releases open resources to accelerate future multilingual NLP research in Indonesia.

Abstract

In Indonesia, local languages play an integral role in the culture. However, the available Indonesian language resources still fall into the category of limited data in the Natural Language Processing (NLP) field. This is become problematic when build NLP model for these languages. To address this gap, we introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages. Our goal is to enhance access and utilization of these resources, extending their reach within the country. We explained in a detail the dataset collection process and associated challenges. Additionally, we experimented with translation task using the IBM Model 1 due to data constraints. The result showed that the performance of each language already shows good indications for further development. Challenges such as lexical variation, smoothing effects, and cross-linguistic variability are discussed. We intend to evaluate the corpus using advanced NLP techniques for low-resource languages, paving the way for multilingual translation models.
Paper Structure (23 sections, 3 figures, 4 tables)

This paper contains 23 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Number of open datasets available for Indonesian local languages
  • Figure 2: Taxonomy of the languages
  • Figure 3: Heatmap of variations in lexical diversity