Automated Analysis of Learning Outcomes and Exam Questions Based on Bloom's Taxonomy
Ramya Kumar, Dhruv Gulwani, Sonit Singh
TL;DR
The paper tackles automated classification of exam items and learning outcomes into Bloom’s six cognitive levels using a multi‑paradigm NLP evaluation on a small dataset of 600 labeled sentences. It conducts a comprehensive comparison across traditional ML, RNNs, transformer models, and zero‑shot large language models, incorporating data preprocessing and synonym-based augmentation. The key finding is that a simple SVM with augmentation achieves the best performance (around 0.94 accuracy), while many deep models overfit on limited data; RoBERTa performs well but remains sensitive to data size, and zero-shot LLMs reach about 0.72–0.73 accuracy. Overall, the work demonstrates that effective Bloom-level classification is feasible with careful data handling and suggests that, in data-scarce settings, simpler models with augmentation or prompt-based LLMs can be viable alternatives to heavy fine-tuning.
Abstract
This paper explores the automatic classification of exam questions and learning outcomes according to Bloom's Taxonomy. A small dataset of 600 sentences labeled with six cognitive categories - Knowledge, Comprehension, Application, Analysis, Synthesis, and Evaluation - was processed using traditional machine learning (ML) models (Naive Bayes, Logistic Regression, Support Vector Machines), recurrent neural network architectures (LSTM, BiLSTM, GRU, BiGRU), transformer-based models (BERT and RoBERTa), and large language models (OpenAI, Gemini, Ollama, Anthropic). Each model was evaluated under different preprocessing and augmentation strategies (for example, synonym replacement, word embeddings, etc.). Among traditional ML approaches, Support Vector Machines (SVM) with data augmentation achieved the best overall performance, reaching 94 percent accuracy, recall, and F1 scores with minimal overfitting. In contrast, the RNN models and BERT suffered from severe overfitting, while RoBERTa initially overcame it but began to show signs as training progressed. Finally, zero-shot evaluations of large language models (LLMs) indicated that OpenAI and Gemini performed best among the tested LLMs, achieving approximately 0.72-0.73 accuracy and comparable F1 scores. These findings highlight the challenges of training complex deep models on limited data and underscore the value of careful data augmentation and simpler algorithms (such as augmented SVM) for Bloom's Taxonomy classification.
