Table of Contents
Fetching ...

How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses

Jionghao Lin, Zifei Han, Danielle R. Thomas, Ashish Gurung, Shivang Gupta, Vincent Aleven, Kenneth R. Koedinger

TL;DR

This paper tackles scaling tutor training by using GPT-4 to automatically identify incorrect trainee responses and to rephrase them into desired, corrective forms across three scenario-based lessons. The authors demonstrate that few-shot prompting yields high classification performance (approximately $F_{1}\approx 0.84$ and $AUC\approx 0.85$) and that GPT-4 can produce rephrasings whose accuracy often matches or exceeds human experts, enabling real-time explanatory feedback within a template-based system. The work provides two key contributions: a binary classifier for tutor responses and a rephrasing module that translates incorrect responses into correct ones, both evaluated against human annotations and expert rephrasings. This approach offers a scalable pathway to improve novice tutor training and holds promise for integration into synchronous tutoring platforms, with future work exploring broader lessons, advanced prompting strategies, and human-in-the-loop quality control.

Abstract

One-on-one tutoring is widely acknowledged as an effective instructional method, conditioned on qualified tutors. However, the high demand for qualified tutors remains a challenge, often necessitating the training of novice tutors (i.e., trainees) to ensure effective tutoring. Research suggests that providing timely explanatory feedback can facilitate the training process for trainees. However, it presents challenges due to the time-consuming nature of assessing trainee performance by human experts. Inspired by the recent advancements of large language models (LLMs), our study employed the GPT-4 model to build an explanatory feedback system. This system identifies trainees' responses in binary form (i.e., correct/incorrect) and automatically provides template-based feedback with responses appropriately rephrased by the GPT-4 model. We conducted our study on 410 responses from trainees across three training lessons: Giving Effective Praise, Reacting to Errors, and Determining What Students Know. Our findings indicate that: 1) using a few-shot approach, the GPT-4 model effectively identifies correct/incorrect trainees' responses from three training lessons with an average F1 score of 0.84 and an AUC score of 0.85; and 2) using the few-shot approach, the GPT-4 model adeptly rephrases incorrect trainees' responses into desired responses, achieving performance comparable to that of human experts.

How Can I Get It Right? Using GPT to Rephrase Incorrect Trainee Responses

TL;DR

This paper tackles scaling tutor training by using GPT-4 to automatically identify incorrect trainee responses and to rephrase them into desired, corrective forms across three scenario-based lessons. The authors demonstrate that few-shot prompting yields high classification performance (approximately and ) and that GPT-4 can produce rephrasings whose accuracy often matches or exceeds human experts, enabling real-time explanatory feedback within a template-based system. The work provides two key contributions: a binary classifier for tutor responses and a rephrasing module that translates incorrect responses into correct ones, both evaluated against human annotations and expert rephrasings. This approach offers a scalable pathway to improve novice tutor training and holds promise for integration into synchronous tutoring platforms, with future work exploring broader lessons, advanced prompting strategies, and human-in-the-loop quality control.

Abstract

One-on-one tutoring is widely acknowledged as an effective instructional method, conditioned on qualified tutors. However, the high demand for qualified tutors remains a challenge, often necessitating the training of novice tutors (i.e., trainees) to ensure effective tutoring. Research suggests that providing timely explanatory feedback can facilitate the training process for trainees. However, it presents challenges due to the time-consuming nature of assessing trainee performance by human experts. Inspired by the recent advancements of large language models (LLMs), our study employed the GPT-4 model to build an explanatory feedback system. This system identifies trainees' responses in binary form (i.e., correct/incorrect) and automatically provides template-based feedback with responses appropriately rephrased by the GPT-4 model. We conducted our study on 410 responses from trainees across three training lessons: Giving Effective Praise, Reacting to Errors, and Determining What Students Know. Our findings indicate that: 1) using a few-shot approach, the GPT-4 model effectively identifies correct/incorrect trainees' responses from three training lessons with an average F1 score of 0.84 and an AUC score of 0.85; and 2) using the few-shot approach, the GPT-4 model adeptly rephrases incorrect trainees' responses into desired responses, achieving performance comparable to that of human experts.
Paper Structure (18 sections, 2 equations, 6 figures, 11 tables)

This paper contains 18 sections, 2 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: An example of a trainee (i.e., novice tutor) incorrectly responding to an open-ended question on how to best reply to a student by giving effective praise. In this particular example, the trainee is praising the student for getting the problem correct, which is achievement or outcomes-based praise and not based on effort.
  • Figure 2: Explanatory feedback for novice tutor responses.
  • Figure 3: Distribution of accuracy and responsiveness scores from the lesson Giving Effective Praise
  • Figure 4: Distribution of accuracy and responsiveness scores from the lesson Reacting to Errors
  • Figure 5: Distribution of accuracy and responsiveness scores from the lesson Determining What Students Know
  • ...and 1 more figures