Table of Contents
Fetching ...

Implicit Discourse Relation Classification For Nigerian Pidgin

Muhammed Saeed, Peter Bourgonje, Vera Demberg

TL;DR

This paper systematically compares an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which they translate PDTB and project PDTB labels, and then train an NP IDR classifier.

Abstract

Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a "native" NP classifier outperforms our baseline by 13.27\% and 33.98\% in f$_{1}$ score for 4-way and 11-way classification, respectively.

Implicit Discourse Relation Classification For Nigerian Pidgin

TL;DR

This paper systematically compares an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which they translate PDTB and project PDTB labels, and then train an NP IDR classifier.

Abstract

Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a "native" NP classifier outperforms our baseline by 13.27\% and 33.98\% in f score for 4-way and 11-way classification, respectively.

Paper Structure

This paper contains 32 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: PDTB2.0 Sense Hierarchy
  • Figure 2: Test-set Relation Sense Distribution.
  • Figure 3: In the fine-tuning approach, the upper part illustrates the relation-based (RB) method with parallel English and NP sentences to align and extract arguments in NP. The lower part illustrates the argument-based (AB) method, directly translating continuous arguments and using alignment (essentially, the RB method) for (a relatively small number of) discontinuous ones. The AB method relies less on alignment, resulting in fewer lost arguments/relations, hence the NP PDTB AB dataset is larger than the NP PDTB RB dataset.
  • Figure 4: Illustration of the four different approaches outlined in Sections \ref{['sec:EnglishIDRConNP']} and \ref{['sec:FinetuningusingNPPDTB']}.
  • Figure 5: AWESoME+PFT NP PDTB relation sense distribution on top-level (top) and second-level (bottom).