Table of Contents
Fetching ...

Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

Clemencia Siro, Tunde Oluwaseyi Ajayi

TL;DR

The paper addresses the brittleness of machine reading comprehension (MRC) models under test-time perturbations by renaming entities from low-resource regions. It introduces EntSwap to substitute six entity types with Africa-origin names, constructing AfriSQuAD2 from SQuAD2.0 and evaluating three popular MRC models on both datasets using EM and F1 metrics. Findings show that large models generalize better to novel entities but still suffer noticeable drops, particularly for person, organization, and location renaming, with error analyses highlighting reliance on world knowledge and exposure biases. The work highlights the need for more robust MRC models and diverse dataset representations to reduce region-based brittleness and informs future research directions toward fairness and robustness in QA systems.

Abstract

Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.

Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming

TL;DR

The paper addresses the brittleness of machine reading comprehension (MRC) models under test-time perturbations by renaming entities from low-resource regions. It introduces EntSwap to substitute six entity types with Africa-origin names, constructing AfriSQuAD2 from SQuAD2.0 and evaluating three popular MRC models on both datasets using EM and F1 metrics. Findings show that large models generalize better to novel entities but still suffer noticeable drops, particularly for person, organization, and location renaming, with error analyses highlighting reliance on world knowledge and exposure biases. The work highlights the need for more robust MRC models and diverse dataset representations to reduce region-based brittleness and informs future research directions toward fairness and robustness in QA systems.

Abstract

Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model's brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the MRC models' performance.
Paper Structure (16 sections, 1 figure, 6 tables)