Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation
Nicy Scaria, Suma Dharani Chenna, Deepak Subramani
TL;DR
This study investigates automated educational question generation (AEQG) across Bloom’s taxonomy using five state-of-the-art LLMs and varied prompting strategies. By generating 2550 questions over 17 data-science topics and evaluating them with two human experts and an automated evaluator, the work demonstrates that instruction-tuned LLMs can yield diverse, high-quality questions at multiple cognitive levels, with GPT-4 and GPT-3.5 often leading performance. The results reveal substantial model- and prompt-dependent variability, show that too much prompt information can degrade performance for open-source models, and expose a clear gap between automated evaluations and human judgments. The release of the DataScienceQ dataset and the insights into prompting strategies offer practical guidance for scalable, cognitively diverse AEQG in education while outlining important avenues for improving automated evaluation methods and context-specific content generation.
Abstract
Developing questions that are pedagogically sound, relevant, and promote learning is a challenging and time-consuming task for educators. Modern-day large language models (LLMs) generate high-quality content across multiple domains, potentially helping educators to develop high-quality questions. Automated educational question generation (AEQG) is important in scaling online education catering to a diverse student population. Past attempts at AEQG have shown limited abilities to generate questions at higher cognitive levels. In this study, we examine the ability of five state-of-the-art LLMs of different sizes to generate diverse and high-quality questions of different cognitive levels, as defined by Bloom's taxonomy. We use advanced prompting techniques with varying complexity for AEQG. We conducted expert and LLM-based evaluations to assess the linguistic and pedagogical relevance and quality of the questions. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information, although there is a significant variance in the performance of the five LLms considered. We also show that automated evaluation is not on par with human evaluation.
