How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?
Subhankar Maity, Aniket Deroy, Sudeshna Sarkar
TL;DR
This work investigates zero-shot question generation from NCERT textbooks using GPT-4 Turbo and assesses alignment with Bloom's Revised Taxonomy via ML classifiers and human validation, complemented by IWF-based quality metrics. It finds notable alignment at Understanding and Remembering levels, but substantial confusion between adjacent taxonomy levels, and reveals discrepancies between human and machine quality judgments, especially at higher cognitive levels. The study demonstrates GPT-4 Turbo's potential for automated educational content creation while underscoring the need for hybrid validation and targeted improvements to taxonomy discrimination and item quality. The proposed future directions include few-shot prompting, improved taxonomy modeling, and context-aware evaluation to better meet educational standards.
Abstract
We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
