Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Danielle R. Thomas; Conrad Borchers; Sanjit Kakarla; Jionghao Lin; Shambhavi Bhushan; Boyuan Guo; Erin Gatz; Kenneth R. Koedinger

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

TL;DR

This study investigates whether multiple-choice questions (MCQs) remain effective learning tools in the era of generative AI, comparing MCQ-only, open-response-only, and combined formats across six advocacy-focused tutor lessons using a posttest-only randomized design with 234 tutors. It examines both learning outcomes and instruction time, and explores the scalability of autograding open-ended responses with GPT-4o and GPT-4-turbo. The results show no overall learning differences by condition, but MCQ-only condenses practice time; most efficiency gains come from limiting task duration rather than content loss. The research contributes extensive lesson log data, human annotation rubrics, and AI prompts to promote transparency, while showing that LLMs can perform open-ended scoring with meaningful accuracy though with context-dependent limitations requiring further study for wide-scale use.

Abstract

The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

TL;DR

Abstract

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)