Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation

Chentianye Xu; Jionghao Lin; Tongshuang Wu; Vincent Aleven; Kenneth R. Koedinger

Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation

Chentianye Xu, Jionghao Lin, Tongshuang Wu, Vincent Aleven, Kenneth R. Koedinger

TL;DR

This work addresses data scarcity in automated feedback for tutor training in low-resource settings by introducing a GPT-4o–driven data-augmentation pipeline to generate synthetic labeled responses, which are then used to fine-tune GPT-3.5 for sequence labeling of praise components. The approach targets accurate identification of effort-, outcome-, and person-based praise to deliver explanatory feedback, and is evaluated using $IoU$, $M\text{-}IoU$, and $F_{2}$ metrics. Results show that augmented data substantially improves model performance, with outcome-based praise benefiting most and generalization to person-based praise achieving near-ceiling performance at moderate augmentation levels. The method reduces labeling requirements while maintaining high-quality feedback, suggesting broad applicability to other educational tasks and potential integration with retrieval-augmented systems like GraphRAG for scalable, low-resource tutor training.

Abstract

Tutoring is an effective instructional method for enhancing student learning, yet its success relies on the skill and experience of the tutors. This reliance presents challenges for the widespread implementation of tutoring, particularly in training novice tutors. To support tutor training programs, real-time automated feedback systems are essential for efficiently training large numbers of tutors. Lin et al.'s previous study employed Generative Pre-Trained Transformers (GPT) for sequence labeling to identify desirable and undesirable praise components in a tutor training dataset, providing explanatory feedback. However, this approach requires a significant amount of labeled data for fine-tuning, which is both labor-intensive and dependent on expert input. To address the challenges associated with extensive data labeling, the current study explores the use of prompting more advanced GPT models like GPT-4o to generate synthetic datasets for augmenting labeled response data, followed by fine-tuning a GPT-3.5 model. Our results demonstrate that our data augmentation approach generalizes effectively to identify other types of praise, compared to the same model fine-tuned without augmentation. These findings suggest that for data-intensive tasks, synthetic data generated through GPT model prompting can substantially enhance fine-tuned model performance in low-resource scenarios.

Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation

TL;DR

, and

metrics. Results show that augmented data substantially improves model performance, with outcome-based praise benefiting most and generalization to person-based praise achieving near-ceiling performance at moderate augmentation levels. The method reduces labeling requirements while maintaining high-quality feedback, suggesting broad applicability to other educational tasks and potential integration with retrieval-augmented systems like GraphRAG for scalable, low-resource tutor training.

Abstract

Paper Structure (35 sections, 3 equations, 8 figures, 11 tables)

This paper contains 35 sections, 3 equations, 8 figures, 11 tables.

Introduction
Background
Tutoring Practice: Giving Effective Praise
Automated Feedback for Tutor Training
Sequence Labeling for Feedback Generation
Text Data Augmentation
Method
Dataset
Sequence Labeling
Fine-tuning GPTs with Augmented Data
Fine-tuning GPTs
Data Augmentation with GPT-4o
Metrics
Results
Results on RQ1
...and 20 more sections

Figures (8)

Figure 1: Labeling the praise components using IO scheme.
Figure 2: Data augmentation process. Outcome-based praise (e.g., “Good job”) and effort-based praise (e.g., “Hard work paid off”) were diversified using GPT-4o to generate synonymous phrases, enabling the creation of varied responses to enhance model generalization.
Figure 3: Performance of the fine-tuned GPT-3.5 model on highlighting correct types of praise with different augmented training set size.
Figure 4: Word length distributions for outcome-based praise (top) and effort-based praise (bottom).
Figure 5: PaCMAP visualization of embedding spaces for outcome-based and effort-based praise.
...and 3 more figures

Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation

TL;DR

Abstract

Improving Automated Feedback Systems for Tutor Training in Low-Resource Scenarios through Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)