Table of Contents
Fetching ...

A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks

Elie Antoine, Frédéric Béchet, Géraldine Damnati, Philippe Langlais

TL;DR

The paper addresses the gap in evaluating reading comprehension by moving beyond aggregate metrics to linguistically grounded, fine-grained analysis. It introduces a semantic-frame–based methodology with seven complexity factors and a RO VER–driven partitioning strategy to quantify how linguistic properties affect performance across models of varying size and architecture. Validated on the French CALOR corpus and extended to English NaturalQA (via a ChatGPT-based proxy for frame labeling), the study shows that factors such as the number of Frame Elements ($f_{nb\;FEs}$) and the entropy of lexical unit distributions ($f_{entropy}$) robustly predict difficulty, with larger models not necessarily overcoming these challenges. The approach offers a practical, transferable way to monitor progress in reading comprehension by automatically identifying semantically complex examples and highlights the need for linguistic-aware evaluation in NLP benchmarking.

Abstract

We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model's difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models' failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using Chat-GPT as a proxy for semantic annotation. Our study reveals that fine-grained linguisticallymotivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models' abilities to handle specific linguistic characteristics of input examples. It also shows that current state-of-the-art models fail with some for those characteristics which suggests that adequately handling them requires more than merely increasing model size.

A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks

TL;DR

The paper addresses the gap in evaluating reading comprehension by moving beyond aggregate metrics to linguistically grounded, fine-grained analysis. It introduces a semantic-frame–based methodology with seven complexity factors and a RO VER–driven partitioning strategy to quantify how linguistic properties affect performance across models of varying size and architecture. Validated on the French CALOR corpus and extended to English NaturalQA (via a ChatGPT-based proxy for frame labeling), the study shows that factors such as the number of Frame Elements () and the entropy of lexical unit distributions () robustly predict difficulty, with larger models not necessarily overcoming these challenges. The approach offers a practical, transferable way to monitor progress in reading comprehension by automatically identifying semantically complex examples and highlights the need for linguistic-aware evaluation in NLP benchmarking.

Abstract

We introduce an evaluation methodology for reading comprehension tasks based on the intuition that certain examples, by the virtue of their linguistic complexity, consistently yield lower scores regardless of model size or architecture. We capitalize on semantic frame annotation for characterizing this complexity, and study seven complexity factors that may account for model's difficulty. We first deploy this methodology on a carefully annotated French reading comprehension benchmark showing that two of those complexity factors are indeed good predictors of models' failure, while others are less so. We further deploy our methodology on a well studied English benchmark by using Chat-GPT as a proxy for semantic annotation. Our study reveals that fine-grained linguisticallymotivated automatic evaluation of a reading comprehension task is not only possible, but helps understand models' abilities to handle specific linguistic characteristics of input examples. It also shows that current state-of-the-art models fail with some for those characteristics which suggests that adequately handling them requires more than merely increasing model size.

Paper Structure

This paper contains 45 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Example of sentence annotated with two semantic frames
  • Figure 2: Example of some complexity factors considered
  • Figure 3: Performance in Hscore according to the agreement number with the ROVER systems' combination method
  • Figure 4: Performance of ROVER according to each frame sorted by Hscore measure. The number of occurrences of each frame in the corpus is given between brackets
  • Figure 5: Hscore on 4 partitions of the evaluation corpus according to combinations of complexity factors
  • ...and 9 more figures