Table of Contents
Fetching ...

How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes

Sabina Elkins, Ekaterina Kochmar, Jackie C. K. Cheung, Iulian Serban

TL;DR

Educational question generation (EQG) using large language models (LLMs) can reduce teacher workload while preserving quiz quality. This paper employs GPT-3.5 with two prompting strategies—simple and Bloom's taxonomy–driven controlled prompts—to generate questions from context passages and evaluates them through teacher-centered experiments comparing handwritten, simple, and controlled quizzes. Results show that teachers prefer quizzes incorporating generated questions and that quality remains comparable, with Bloom-aligned control sometimes enhancing usefulness and coverage. The findings support scalable, pedagogically aligned EQG deployment in classrooms while highlighting the need to extend evaluations to more domains and student outcomes.

Abstract

Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input from real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.

How Teachers Can Use Large Language Models and Bloom's Taxonomy to Create Educational Quizzes

TL;DR

Educational question generation (EQG) using large language models (LLMs) can reduce teacher workload while preserving quiz quality. This paper employs GPT-3.5 with two prompting strategies—simple and Bloom's taxonomy–driven controlled prompts—to generate questions from context passages and evaluates them through teacher-centered experiments comparing handwritten, simple, and controlled quizzes. Results show that teachers prefer quizzes incorporating generated questions and that quality remains comparable, with Bloom-aligned control sometimes enhancing usefulness and coverage. The findings support scalable, pedagogically aligned EQG deployment in classrooms while highlighting the need to extend evaluations to more domains and student outcomes.

Abstract

Question generation (QG) is a natural language processing task with an abundance of potential benefits and use cases in the educational domain. In order for this potential to be realized, QG systems must be designed and validated with pedagogical needs in mind. However, little research has assessed or designed QG approaches with the input from real teachers or students. This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy. The automatically generated questions are used in multiple experiments designed to assess how teachers use them in practice. The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions. Further, several metrics indicate that automatically generated questions can even improve the quality of the quizzes created, showing the promise for large scale use of QG in the classroom setting.
Paper Structure (15 sections, 8 figures, 2 tables)

This paper contains 15 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Controlled prompting strategy.
  • Figure 2: Simple prompting strategy.
  • Figure 3: Quiz writing experiment diagram depicting the three quiz writing settings each teacher completed.
  • Figure 4: Context coverage by cohort and quiz type. * represents a significant difference at the $\alpha = 0.05$ level, and *** represents a significant difference at the $\alpha = 0.001$ level. The error bars represent $95\%$ confidence intervals.
  • Figure 5: Quiz-level quality metrics. The error bars represent $95\%$ confidence intervals. * represents a significant difference at the $\alpha = 0.05$ level.
  • ...and 3 more figures