Table of Contents
Fetching ...

Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin

Pin-Jie Lin, Merel Scholman, Muhammed Saeed, Vera Demberg

TL;DR

This work tackles the lack of a standardized orthography in Nigerian Pidgin by analyzing common orthographic variation and introducing a phonology-driven word-variation synthesis framework. The approach generates plausible spelling variants via phonological distance (PWLD) and uses them to augment training data for sentiment analysis and machine translation, yielding notable gains. Empirical results show a +2.1 point improvement in sentiment analysis F1 and +1.4 BLEU in English translation, with further benefits in cross-domain generalization to unseen corpora. The contribution includes the first systematic variation typology, a reproducible augmentation pipeline, and domain-agnostic evidence that synthetic orthographic variation can enhance NLP for under-resourced pidgin languages.

Abstract

Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.

Modeling Orthographic Variation Improves NLP Performance for Nigerian Pidgin

TL;DR

This work tackles the lack of a standardized orthography in Nigerian Pidgin by analyzing common orthographic variation and introducing a phonology-driven word-variation synthesis framework. The approach generates plausible spelling variants via phonological distance (PWLD) and uses them to augment training data for sentiment analysis and machine translation, yielding notable gains. Empirical results show a +2.1 point improvement in sentiment analysis F1 and +1.4 BLEU in English translation, with further benefits in cross-domain generalization to unseen corpora. The contribution includes the first systematic variation typology, a reproducible augmentation pipeline, and domain-agnostic evidence that synthetic orthographic variation can enhance NLP for under-resourced pidgin languages.

Abstract

Nigerian Pidgin is an English-derived contact language and is traditionally an oral language, spoken by approximately 100 million people. No orthographic standard has yet been adopted, and thus the few available Pidgin datasets that exist are characterised by noise in the form of orthographic variations. This contributes to under-performance of models in critical NLP tasks. The current work is the first to describe various types of orthographic variations commonly found in Nigerian Pidgin texts, and model this orthographic variation. The variations identified in the dataset form the basis of a phonetic-theoretic framework for word editing, which is used to generate orthographic variations to augment training data. We test the effect of this data augmentation on two critical NLP tasks: machine translation and sentiment analysis. The proposed variation generation framework augments the training data with new orthographic variants which are relevant for the test set but did not occur in the training set originally. Our results demonstrate the positive effect of augmenting the training data with a combination of real texts from other corpora as well as synthesized orthographic variation, resulting in performance improvements of 2.1 points in sentiment analysis and 1.4 BLEU points in translation to English.
Paper Structure (35 sections, 3 equations, 4 figures, 9 tables)

This paper contains 35 sections, 3 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Diagram depicting the process of orthographic variation generation for enriching the text corpus. The seed words are transcribed into phoneme sequences (Step 1), and character-phoneme pairs are aligned (Step 2). Next, variants are generated based on the rules (Step 3). We then measure the phonological distance between the word and the heuristic-generated variation candidates upon their phonemes, denoted as $d(t^{w_{i}},t^{w'_{i}})$ (Step 4).
  • Figure 2: Performance on various $K$ augmented sample size. Error bars reflect the standard error over six runs.
  • Figure 3: Number of new variants from variation-enhanced augmentation data.
  • Figure 4: Generalization from JW300 (training) to Naija Treebank (testing). We show the improvements in BLEU scores while varying the number of added orthographic variants.