Table of Contents
Fetching ...

UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions

Ana-Cristina Rogoz, Radu Tudor Ionescu

TL;DR

The paper tackles automated prediction of item difficulty and response time for retired USMLE MCQs by augmenting a small dataset with zero-shot outputs from three 7B LLMs. It evaluates transformer-based regressors across multiple feature sets that combine the question text, answers, and LLM-generated responses, finding that incorporating LLM answers and the question text improves performance, with difficulty proving more challenging than response time. A key insight is that linear probing on frozen features can yield better generalization on limited data, as demonstrated by post-competition $ u$-SVR+BERT methods achieving strong RMSE on the official test. Overall, LLM-driven augmentation shows promise for scalable, secure automated assessment, though future work should address overfitting and leverage larger models under greater computational resources.

Abstract

This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs (Falcon, Meditron, Mistral) and employing transformer-based models based on six alternative feature combinations. The results suggest that predicting the difficulty of questions is more challenging. Notably, our top performing methods consistently include the question text, and benefit from the variability of LLM answers, highlighting the potential of LLMs for improving automated assessment in medical licensing exams. We make our code available https://github.com/ana-rogoz/BEA-2024.

UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions

TL;DR

The paper tackles automated prediction of item difficulty and response time for retired USMLE MCQs by augmenting a small dataset with zero-shot outputs from three 7B LLMs. It evaluates transformer-based regressors across multiple feature sets that combine the question text, answers, and LLM-generated responses, finding that incorporating LLM answers and the question text improves performance, with difficulty proving more challenging than response time. A key insight is that linear probing on frozen features can yield better generalization on limited data, as demonstrated by post-competition -SVR+BERT methods achieving strong RMSE on the official test. Overall, LLM-driven augmentation shows promise for scalable, secure automated assessment, though future work should address overfitting and leverage larger models under greater computational resources.

Abstract

This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs (Falcon, Meditron, Mistral) and employing transformer-based models based on six alternative feature combinations. The results suggest that predicting the difficulty of questions is more challenging. Notably, our top performing methods consistently include the question text, and benefit from the variability of LLM answers, highlighting the potential of LLMs for improving automated assessment in medical licensing exams. We make our code available https://github.com/ana-rogoz/BEA-2024.
Paper Structure (15 sections, 4 figures, 5 tables)

This paper contains 15 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An overview of the data preprocessing and model training workflow for predicting item difficulty and response time of medical exam questions. The initial dataset is enriched with zero-shot prompted responses generated by Large Language Models (LLMs). We then perform preprocessing over the augmented dataset by scaling the target labels, adding new feature combinations, text cleaning and establishing the split for cross-validation. Finally, two alternative transformer-based models are fine-tuned on the augmented data.
  • Figure 2: Left: Correlation between the EXAM integer feature and the difficulty label. Right: Correlation between the EXAM integer feature and the response time label.
  • Figure 3: Left: Correlation between the ItemType integer feature and the difficulty label. Right: Correlation between the ItemType integer feature and the response time label.
  • Figure 4: Left: Correlation between the AnswerKey integer feature and the difficulty label. Right: Correlation between the AnswerKey integer feature and the response time label.