Table of Contents
Fetching ...

Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

Naiming Liu, Shashank Sonkar, Richard G. Baraniuk

TL;DR

This paper investigates whether Large Language Models (LLMs) align with human student error patterns in multiple-choice questions. It introduces a dual analysis framework that (i) compares LLM generation likelihoods to actual student distractor selections and (ii) compares LLM mistakes to the most common student misconceptions, using a dataset of $3{,}202$ MCQs across six domains and two model families ($\text{LLaMA}$ and $\text{Qwen}$) spanning $0.5\text{B}$ to $72\text{B}$ parameters. Key findings show moderate correlations between LLM-generated probabilities and student distractor patterns ($r$ in $[0.28,0.37]$ for index-based prompting, increasing with model size and instruction tuning), and a robust tendency for LLMs to select the most common student distractors when making errors (up to $59\%$ alignment in large models, and ~ $51\%$ in the smallest). The results imply that while LLMs do not fully replicate human reasoning, smaller models can efficiently generate pedagogically relevant distractors by mirroring common misconceptions, offering a cost-effective supplement to human-designed distractors for educational assessment. The work suggests hybrid approaches combining LLM-based distractor generation with human curation to enhance MCQ diagnostics and tutoring systems.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs' strong performance on generating educational content, there remains a gap between LLM's underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.

Do LLMs Make Mistakes Like Students? Exploring Natural Alignment between Language Models and Human Error Patterns

TL;DR

This paper investigates whether Large Language Models (LLMs) align with human student error patterns in multiple-choice questions. It introduces a dual analysis framework that (i) compares LLM generation likelihoods to actual student distractor selections and (ii) compares LLM mistakes to the most common student misconceptions, using a dataset of MCQs across six domains and two model families ( and ) spanning to parameters. Key findings show moderate correlations between LLM-generated probabilities and student distractor patterns ( in for index-based prompting, increasing with model size and instruction tuning), and a robust tendency for LLMs to select the most common student distractors when making errors (up to alignment in large models, and ~ in the smallest). The results imply that while LLMs do not fully replicate human reasoning, smaller models can efficiently generate pedagogically relevant distractors by mirroring common misconceptions, offering a cost-effective supplement to human-designed distractors for educational assessment. The work suggests hybrid approaches combining LLM-based distractor generation with human curation to enhance MCQ diagnostics and tutoring systems.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various educational tasks, yet their alignment with human learning patterns, particularly in predicting which incorrect options students are most likely to select in multiple-choice questions (MCQs), remains underexplored. Our work investigates the relationship between LLM generation likelihood and student response distributions in MCQs with a specific focus on distractor selections. We collect a comprehensive dataset of MCQs with real-world student response distributions to explore two fundamental research questions: (1). RQ1 - Do the distractors that students more frequently select correspond to those that LLMs assign higher generation likelihood to? (2). RQ2 - When an LLM selects a incorrect choice, does it choose the same distractor that most students pick? Our experiments reveals moderate correlations between LLM-assigned probabilities and student selection patterns for distractors in MCQs. Additionally, when LLMs make mistakes, they are more likley to select the same incorrect answers that commonly mislead students, which is a pattern consistent across both small and large language models. Our work provides empirical evidence that despite LLMs' strong performance on generating educational content, there remains a gap between LLM's underlying reasoning process and human cognitive processes in identifying confusing distractors. Our findings also have significant implications for educational assessment development. The smaller language models could be efficiently utilized for automated distractor generation as they demonstrate similar patterns in identifying confusing answer choices as larger language models. This observed alignment between LLMs and student misconception patterns opens new opportunities for generating high-quality distractors that complement traditional human-designed distractors.

Paper Structure

This paper contains 27 sections, 8 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Pearson correlation between LLM generation probabilities and student selection frequencies for incorrect answer choices (distractors) across model sizes. The index-based approach (left) measures correlation for A/B/C/D label selection probabilities, while the text-based approach (right) measures correlation for full distractor text generation probabilities. Results shown for base and instruction-tuned variants of LLaMA and Qwen model families demonstrate relatively stronger alignment between LLMs and student distractor selection patterns as model size increases, especially for text-based approach.