Table of Contents
Fetching ...

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan

Abstract

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.

Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

Abstract

Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.
Paper Structure (54 sections, 5 figures, 19 tables)

This paper contains 54 sections, 5 figures, 19 tables.

Figures (5)

  • Figure 1: An LLM reasoning trace for distractor generation, annotated according to our taxonomy (\ref{['tab:taxonomy-definition']}). Color-coded labels show the model interpreting the task, anchoring in the correct solution, generating candidate distractors based on errors, evaluating their plausibility, and curating the final set-mirroring strategies identified in the learning-science literature (\ref{['sec:ls-foundations']}).
  • Figure 2: Indicates how often each strategy in our taxonomy \ref{['tab:taxonomy-definition']} was annotated at different stages of DeepSeek-V3.2's reasoning trace. Time is normalized (0 = start of trace, 1 = end). Note that the proportions sum to one for each of the five temporal bins.
  • Figure 3: Shows transition probabilities between strategies in DeepSeek-V3.2's traces. Sequences of strategies up to length 4 (left to right). Node height represents strategy share and link widths indicate the transition probabilities between successive strategies. Only dominant (>15% outgoing mass) transitions are visualized for simplicity.
  • Figure 4: Indicates how often each strategy in our taxonomy \ref{['tab:taxonomy-definition']} was annotated at different stages of GLM-4.7's reasoning trace. Time is normalized (0 = start of trace, 1 = end). Note that the proportions sum to one for each of the five temporal bins.
  • Figure 5: Shows transition probabilities between strategies in GLM-4.7's traces. Sequences of strategies up to length 4 (left to right). Node height represents strategy share and link widths indicate the transition probabilities between successive strategies. Only dominant (>15% outgoing mass) transitions are visualized for simplicity.