Table of Contents
Fetching ...

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

Anuoluwapo Aremu, Jesujoba O. Alabi, Daud Abolade, Nkechinyere F. Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani

TL;DR

NaijaRC introduces a new multi-choice reading comprehension dataset for three Nigerian languages (Hausa, Igbo, Yorùbá) derived from high-school examinations. It establishes cross-lingual baselines by fine-tuning Belebele-derived data on encoder-only models and evaluates with prompting LLMs (GPT-3.5, GPT-4). Results show GPT-4 achieves the best overall NaijaRC performance (51.4%), while language-specific PLMs exhibit strengths across languages, with Serengeti excelling when English is excluded. The work demonstrates the feasibility and limitations of current models for RC in under-resourced African languages and points to few-shot adaptation as a promising future direction.

Abstract

In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.

NaijaRC: A Multi-choice Reading Comprehension Dataset for Nigerian Languages

TL;DR

NaijaRC introduces a new multi-choice reading comprehension dataset for three Nigerian languages (Hausa, Igbo, Yorùbá) derived from high-school examinations. It establishes cross-lingual baselines by fine-tuning Belebele-derived data on encoder-only models and evaluates with prompting LLMs (GPT-3.5, GPT-4). Results show GPT-4 achieves the best overall NaijaRC performance (51.4%), while language-specific PLMs exhibit strengths across languages, with Serengeti excelling when English is excluded. The work demonstrates the feasibility and limitations of current models for RC in under-resourced African languages and points to few-shot adaptation as a promising future direction.

Abstract

In this paper, we create NaijaRC: a new multi-choice Reading Comprehension dataset for three native Nigeria languages that is based on high-school reading comprehension examination. We provide baseline results by performing cross-lingual transfer using existing English RACE and Belebele training dataset based on a pre-trained encoder-only model. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.
Paper Structure (7 sections, 1 figure, 1 table)

This paper contains 7 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: An example of a passage in Yorùbá, a question and corresponding options (A-D). Where C is the correct option. We provide an expert translation in English.