Table of Contents
Fetching ...

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

Subhankar Maity, Aniket Deroy, Sudeshna Sarkar

TL;DR

This work investigates zero-shot question generation from NCERT textbooks using GPT-4 Turbo and assesses alignment with Bloom's Revised Taxonomy via ML classifiers and human validation, complemented by IWF-based quality metrics. It finds notable alignment at Understanding and Remembering levels, but substantial confusion between adjacent taxonomy levels, and reveals discrepancies between human and machine quality judgments, especially at higher cognitive levels. The study demonstrates GPT-4 Turbo's potential for automated educational content creation while underscoring the need for hybrid validation and targeted improvements to taxonomy discrimination and item quality. The proposed future directions include few-shot prompting, improved taxonomy modeling, and context-aware evaluation to better meet educational standards.

Abstract

We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.

How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?

TL;DR

This work investigates zero-shot question generation from NCERT textbooks using GPT-4 Turbo and assesses alignment with Bloom's Revised Taxonomy via ML classifiers and human validation, complemented by IWF-based quality metrics. It finds notable alignment at Understanding and Remembering levels, but substantial confusion between adjacent taxonomy levels, and reveals discrepancies between human and machine quality judgments, especially at higher cognitive levels. The study demonstrates GPT-4 Turbo's potential for automated educational content creation while underscoring the need for hybrid validation and targeted improvements to taxonomy discrimination and item quality. The proposed future directions include few-shot prompting, improved taxonomy modeling, and context-aware evaluation to better meet educational standards.

Abstract

We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
Paper Structure (11 sections, 3 figures)

This paper contains 11 sections, 3 figures.

Figures (3)

  • Figure 1: Prompt template for generating questions following Bloom’s revised taxonomy in the zero-shot setting.
  • Figure 2: The level of alignment in Bloom’s Revised Taxonomy (a) between the GPT-4-Taxonomy and ML-Taxonomy (w/ 150 samples) and (b) between the GPT-4-Taxonomy and Human-Taxonomy (w/ 60 samples).
  • Figure 3: Quality evaluation outcomes for a sample of 60 questions generated by GPT-4-Turbo (i.e., GPT-4-Taxonomy) were evaluated by (a) a human teacher (i.e., Human-Validation) and (b) an ML model according to the IWF criteria (i.e., Machine-Validation), as well as the agreement between the two validation approaches.