How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Subhankar Maity; Aniket Deroy; Sudeshna Sarkar

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

TL;DR

This work investigates zero-shot question generation from NCERT textbooks using GPT-4 Turbo and assesses alignment with Bloom's Revised Taxonomy via ML classifiers and human validation, complemented by IWF-based quality metrics. It finds notable alignment at Understanding and Remembering levels, but substantial confusion between adjacent taxonomy levels, and reveals discrepancies between human and machine quality judgments, especially at higher cognitive levels. The study demonstrates GPT-4 Turbo's potential for automated educational content creation while underscoring the need for hybrid validation and targeted improvements to taxonomy discrimination and item quality. The proposed future directions include few-shot prompting, improved taxonomy modeling, and context-aware evaluation to better meet educational standards.

Abstract

We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

TL;DR

Abstract

Paper Structure (11 sections, 3 figures)

This paper contains 11 sections, 3 figures.

Introduction
Related Work
Dataset
Methodology
Evaluation
Evaluation of Question Quality
Strategy for Evaluating Bloom's Revised Taxonomy
Human Evaluation
Results
Analysis
Conclusion and Future Work

Figures (3)

Figure 1: Prompt template for generating questions following Bloom’s revised taxonomy in the zero-shot setting.
Figure 2: The level of alignment in Bloom’s Revised Taxonomy (a) between the GPT-4-Taxonomy and ML-Taxonomy (w/ 150 samples) and (b) between the GPT-4-Taxonomy and Human-Taxonomy (w/ 60 samples).
Figure 3: Quality evaluation outcomes for a sample of 60 questions generated by GPT-4-Turbo (i.e., GPT-4-Taxonomy) were evaluated by (a) a human teacher (i.e., Human-Validation) and (b) an ML model according to the IWF criteria (i.e., Machine-Validation), as well as the agreement between the two validation approaches.

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

TL;DR

Abstract

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)