Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring
Hiroaki Funayama, Yuya Asazuma, Yuichiroh Matsubayashi, Tomoya Mizumoto, Kentaro Inui
TL;DR
The paper tackles the high data cost of Automated Short Answer Scoring (SAS) by introducing a two-phase cross-prompt training strategy: pre-finetune on cross-prompt data using key phrases to learn a general scoring principle, then finetune on a target prompt with in-prompt data. The approach uses a BERT-based regression model where inputs combine rubric key phrases and student answers, and a score function $m(oldsymbol{x})$ mapped to $[0,1]$. Experiments on the RIKEN Japanese SAS dataset show that pre-finetuning with key phrases substantially improves accuracy, especially when in-prompt data are scarce, and can halve the required labeled data. An extensive analysis demonstrates that the model captures the scoring principle and that the diversity of pre-finetuning prompts enhances generalization, though mere cross-prompt pretraining without key phrases is ineffective. Overall, the method offers a data-efficient path for deploying SAS across many prompts, with publicly released code and settings to enable reproducibility.
Abstract
Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task's general property.
