NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Anuoluwapo Aremu; Jesujoba O. Alabi; Daud Abolade; Nkechinyere F. Aguobi; Shamsuddeen Hassan Muhammad; David Ifeoluwa Adelani

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

TL;DR

NaijaRC introduces a new multi-choice reading comprehension dataset for three Nigerian languages (Hausa, Igbo, Yorùbá) derived from high-school examinations. It establishes cross-lingual baselines by fine-tuning Belebele-derived data on encoder-only models and evaluates with prompting LLMs (GPT-3.5, GPT-4). Results show GPT-4 achieves the best overall NaijaRC performance (51.4%), while language-specific PLMs exhibit strengths across languages, with Serengeti excelling when English is excluded. The work demonstrates the feasibility and limitations of current models for RC in under-resourced African languages and points to few-shot adaptation as a promising future direction.

Abstract

In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

TL;DR

Abstract

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (1)