Table of Contents
Fetching ...

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

Andreas Säuberli, Simon Clematide

TL;DR

This work investigates automatic generation and evaluation of German reading comprehension items using large language models. It introduces a text informativity metric, combining answerability and guessability, and validates a protocol with human annotators and two LLMs (GPT-4 and Llama 2). Results show zero-shot generation yields items of acceptable quality, with GPT-4 outperforming Llama 2 and aligning well with human judgments in automatic evaluation. The study demonstrates the practicality of LLM-based item generation and evaluation, particularly for languages with limited MCRC resources, and outlines directions for reinforcement learning and improved evaluation in future work.

Abstract

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

TL;DR

This work investigates automatic generation and evaluation of German reading comprehension items using large language models. It introduces a text informativity metric, combining answerability and guessability, and validates a protocol with human annotators and two LLMs (GPT-4 and Llama 2). Results show zero-shot generation yields items of acceptable quality, with GPT-4 outperforming Llama 2 and aligning well with human judgments in automatic evaluation. The study demonstrates the practicality of LLM-based item generation and evaluation, particularly for languages with limited MCRC resources, and outlines directions for reinforcement learning and improved evaluation in future work.

Abstract

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.
Paper Structure (35 sections, 5 figures, 4 tables)

This paper contains 35 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Our evaluation protocol measures the answerability and guessability of MCRC items by letting high-performing humans or LLMs respond to them with and without seeing the text. The text informativity metric is the difference between answerability and guessability and denotes to what degree the text informs the item responses.
  • Figure 2: Mean human and LLM response accuracies on human-written and LLM-generated items. The distance between the two points corresponds to text informativity. Accuracies are on the level of answer options, therefore random guessing is at 0.5. For human evaluators, means are based on 10 texts and around 185 responses without text and around 546 responses with text. For LLM evaluators, means are based on 50 texts and around 451 responses in both settings. Error bars are bootstrapped 95% confidence intervals.
  • Figure 3: Distributions of human quality ratings and their relation to human response accuracy. A rating of 1 means unusable, 5 means perfect.
  • Figure 4: Screenshot of the user interface for the human evaluation, without text.
  • Figure 5: Screenshot of the user interface for the human evaluation, with text and quality ratings.