Table of Contents
Fetching ...

Cyborg Data: Merging Human with AI Generated Training Data

Kai North, Christopher Ormerod

TL;DR

This work tackles the high cost of hand-scored data in automated essay scoring by proposing Cyborg Data, a teacher–student distillation pipeline where a large Teacher model generates scores for unlabeled data to augment training of a smaller Student. The approach demonstrates that Students trained on augmented data with as little as 10% hand-scored data can achieve performance near models trained on the full dataset, significantly reducing annotation costs. However, the synthetic data introduces demographic biases, particularly disadvantaging certain groups, underscoring the need for calibration and bias-mitigation strategies. The study highlights a practical path toward scalable AES deployment while emphasizing fairness considerations and avenues for future improvement in calibration and bias control.

Abstract

Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset "Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on "Cyborg Data" show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.

Cyborg Data: Merging Human with AI Generated Training Data

TL;DR

This work tackles the high cost of hand-scored data in automated essay scoring by proposing Cyborg Data, a teacher–student distillation pipeline where a large Teacher model generates scores for unlabeled data to augment training of a smaller Student. The approach demonstrates that Students trained on augmented data with as little as 10% hand-scored data can achieve performance near models trained on the full dataset, significantly reducing annotation costs. However, the synthetic data introduces demographic biases, particularly disadvantaging certain groups, underscoring the need for calibration and bias-mitigation strategies. The study highlights a practical path toward scalable AES deployment while emphasizing fairness considerations and avenues for future improvement in calibration and bias control.

Abstract

Automated scoring (AS) systems used in large-scale assessment have traditionally used small statistical models that require a large quantity of hand-scored data to make accurate predictions, which can be time-consuming and costly. Generative Large Language Models are trained on many tasks and have shown impressive abilities to generalize to new tasks with little to no data. While these models require substantially more computational power to make predictions, they still require some fine-tuning to meet operational standards. Evidence suggests that these models can exceed human-human levels of agreement even when fine-tuned on small amounts of data. With this in mind, we propose a model distillation pipeline in which a large generative model, a Teacher, teaches a much smaller model, a Student. The Teacher, trained on a small subset of the training data, is used to provide scores on the remaining training data, which is then used to train the Student. We call the resulting dataset "Cyborg Data", as it combines human and machine-scored responses. Our findings show that Student models trained on "Cyborg Data" show performance comparable to training on the entire dataset, while only requiring 10% of the original hand-scored data.

Paper Structure

This paper contains 11 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The template used to create the prompt for training the Teacher GLMs.
  • Figure 2: The average QWKs and SMDs for the ModernBERT and ELECTRA Apprentice models used for scoring when trained on original only (Orig.) and a percentage of the original train set plus remaining synthetic (w/Aug.), i.e. 10% original and 90% synthetic, 20% original and 80% synthetic and so on.
  • Figure 3: SMDs for the augmented dataset's essay scores for gender, English Language Learner (ELL) status, disability and economic status compared to their original scores provided by the PERSUADE corpus. Percentages are in relation to the amount of original data used within each training set. The remaining percentage being synthetic.
  • Figure 4: SMDs for the augmented dataset's essay scores for various racial/ethnic groups compared to their original scores provided by the PERSUADE corpus. Percentages are in relation to the amount of original data used within each training set. The remaining percentage being synthetic.