Table of Contents
Fetching ...

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger

TL;DR

This study investigates whether multiple-choice questions (MCQs) remain effective learning tools in the era of generative AI, comparing MCQ-only, open-response-only, and combined formats across six advocacy-focused tutor lessons using a posttest-only randomized design with 234 tutors. It examines both learning outcomes and instruction time, and explores the scalability of autograding open-ended responses with GPT-4o and GPT-4-turbo. The results show no overall learning differences by condition, but MCQ-only condenses practice time; most efficiency gains come from limiting task duration rather than content loss. The research contributes extensive lesson log data, human annotation rubrics, and AI prompts to promote transparency, while showing that LLMs can perform open-ended scoring with meaningful accuracy though with context-dependent limitations requiring further study for wide-scale use.

Abstract

The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

TL;DR

This study investigates whether multiple-choice questions (MCQs) remain effective learning tools in the era of generative AI, comparing MCQ-only, open-response-only, and combined formats across six advocacy-focused tutor lessons using a posttest-only randomized design with 234 tutors. It examines both learning outcomes and instruction time, and explores the scalability of autograding open-ended responses with GPT-4o and GPT-4-turbo. The results show no overall learning differences by condition, but MCQ-only condenses practice time; most efficiency gains come from limiting task duration rather than content loss. The research contributes extensive lesson log data, human annotation rubrics, and AI prompts to promote transparency, while showing that LLMs can perform open-ended scoring with meaningful accuracy though with context-dependent limitations requiring further study for wide-scale use.

Abstract

The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

Paper Structure

This paper contains 23 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Instructional design sequence of the lessons illustrating the three learning-by-doing conditions, then the follow-up instruction phase, and concluding with posttest.
  • Figure 2: Average posttest scores compared across learning-by-doing conditions: MCQ Only, Open-response Only, or Both. No significant differences were found in posttest scores between conditions. Error bars represent 95% confidence intervals.
  • Figure 3: Average instruction time prior to posttest compared across learning-by-doing conditions: MCQ Only, Open-response Only, or Both. Although MCQ Only took less time on average, no overall significant differences in instruction time prior to posttest were found between the conditions. Error bars represent 95% confidence intervals.