Table of Contents
Fetching ...

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

Aniket Deroy, Subhankar Maity, Sudeshna Sarkar

TL;DR

This work tackles the challenge of evaluating open-ended questions generated by AQG systems, where traditional automated metrics fail to capture higher-order educational qualities. It introduces MIRROR, a feedback-based framework using a two-LLM loop ($LLM_1$, $LLM_2$) to iteratively rate questions on grammaticality, relevance, appropriateness, novelty, and complexity, converging to human-like judgments. Across EduProbe and SciQ datasets and multiple LLMs (GPT-4, Gemini, Llama2-70b), MIRROR improves metric scores toward human baselines and strengthens the correlation with human evaluators ($r$), especially for relevance and appropriateness. The approach demonstrates scalability and potential to replace or augment human evaluation in AQG systems, with future work aimed at longer contexts and broader domains to enhance robustness and applicability.

Abstract

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation

TL;DR

This work tackles the challenge of evaluating open-ended questions generated by AQG systems, where traditional automated metrics fail to capture higher-order educational qualities. It introduces MIRROR, a feedback-based framework using a two-LLM loop (, ) to iteratively rate questions on grammaticality, relevance, appropriateness, novelty, and complexity, converging to human-like judgments. Across EduProbe and SciQ datasets and multiple LLMs (GPT-4, Gemini, Llama2-70b), MIRROR improves metric scores toward human baselines and strengthens the correlation with human evaluators (), especially for relevance and appropriateness. The approach demonstrates scalability and potential to replace or augment human evaluation in AQG systems, with future work aimed at longer contexts and broader domains to enhance robustness and applicability.

Abstract

Automatic question generation is a critical task that involves evaluating question quality by considering factors such as engagement, pedagogical value, and the ability to stimulate critical thinking. These aspects require human-like understanding and judgment, which automated systems currently lack. However, human evaluations are costly and impractical for large-scale samples of generated questions. Therefore, we propose a novel system, MIRROR (Multi-LLM Iterative Review and Response for Optimized Rating), which leverages large language models (LLMs) to automate the evaluation process for questions generated by automated question generation systems. We experimented with several state-of-the-art LLMs, such as GPT-4, Gemini, and Llama2-70b. We observed that the scores of human evaluation metrics, namely relevance, appropriateness, novelty, complexity, and grammaticality, improved when using the feedback-based approach called MIRROR, tending to be closer to the human baseline scores. Furthermore, we observed that Pearson's correlation coefficient between GPT-4 and human experts improved when using our proposed feedback-based approach, MIRROR, compared to direct prompting for evaluation. Error analysis shows that our proposed approach, MIRROR, significantly helps to improve relevance and appropriateness.

Paper Structure

This paper contains 17 sections, 16 figures, 9 tables, 2 algorithms.

Figures (16)

  • Figure 1: A sample of <Context, Generated Question> pairs from the EduProbe and SciQ datasets.
  • Figure 2: Prompt used on GPT-3.5 Turbo to generate a question from a context.
  • Figure 3: An overview of the direct prompting approach.
  • Figure 4: Prompt used in direct approach for evaluating human evaluation metrics.
  • Figure 5: An overview of the proposed approach called MIRROR.
  • ...and 11 more figures