Table of Contents
Fetching ...

Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring

Hiroaki Funayama, Yuya Asazuma, Yuichiroh Matsubayashi, Tomoya Mizumoto, Kentaro Inui

TL;DR

The paper tackles the high data cost of Automated Short Answer Scoring (SAS) by introducing a two-phase cross-prompt training strategy: pre-finetune on cross-prompt data using key phrases to learn a general scoring principle, then finetune on a target prompt with in-prompt data. The approach uses a BERT-based regression model where inputs combine rubric key phrases and student answers, and a score function $m(oldsymbol{x})$ mapped to $[0,1]$. Experiments on the RIKEN Japanese SAS dataset show that pre-finetuning with key phrases substantially improves accuracy, especially when in-prompt data are scarce, and can halve the required labeled data. An extensive analysis demonstrates that the model captures the scoring principle and that the diversity of pre-finetuning prompts enhances generalization, though mere cross-prompt pretraining without key phrases is ineffective. Overall, the method offers a data-efficient path for deploying SAS across many prompts, with publicly released code and settings to enable reproducibility.

Abstract

Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task's general property.

Reducing the Cost: Cross-Prompt Pre-Finetuning for Short Answer Scoring

TL;DR

The paper tackles the high data cost of Automated Short Answer Scoring (SAS) by introducing a two-phase cross-prompt training strategy: pre-finetune on cross-prompt data using key phrases to learn a general scoring principle, then finetune on a target prompt with in-prompt data. The approach uses a BERT-based regression model where inputs combine rubric key phrases and student answers, and a score function mapped to . Experiments on the RIKEN Japanese SAS dataset show that pre-finetuning with key phrases substantially improves accuracy, especially when in-prompt data are scarce, and can halve the required labeled data. An extensive analysis demonstrates that the model captures the scoring principle and that the diversity of pre-finetuning prompts enhances generalization, though mere cross-prompt pretraining without key phrases is ineffective. Overall, the method offers a data-efficient path for deploying SAS across many prompts, with publicly released code and settings to enable reproducibility.

Abstract

Automated Short Answer Scoring (SAS) is the task of automatically scoring a given input to a prompt based on rubrics and reference answers. Although SAS is useful in real-world applications, both rubrics and reference answers differ between prompts, thus requiring a need to acquire new data and train a model for each new prompt. Such requirements are costly, especially for schools and online courses where resources are limited and only a few prompts are used. In this work, we attempt to reduce this cost through a two-phase approach: train a model on existing rubrics and answers with gold score signals and finetune it on a new prompt. Specifically, given that scoring rubrics and reference answers differ for each prompt, we utilize key phrases, or representative expressions that the answer should contain to increase scores, and train a SAS model to learn the relationship between key phrases and answers using already annotated prompts (i.e., cross-prompts). Our experimental results show that finetuning on existing cross-prompt data with key phrases significantly improves scoring accuracy, especially when the training data is limited. Finally, our extensive analysis shows that it is crucial to design the model so that it can learn the task's general property.
Paper Structure (17 sections, 6 equations, 6 figures, 1 table)

This paper contains 17 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of our proposed method. We input key phrases, reference expressions, with an answer. We first pre-finetune the SAS model on already annotated prompts and then finetune the model on a prompt to be graded.
  • Figure 2: Example of a prompt, scoring rubric, key phrase and student's answers excerpted from RIKEN dataset mizumoto-etal-2019-analytic and translated from Japanese to English. For space reasons, some parts of the rubrics and key phrase are omitted.
  • Figure 3: Overall architecture of our model. We input key phrases and a student answer split by the [SEP] token.
  • Figure 4: QWK and standard deviation of four settings described in Section \ref{['ssec:setting']}; Baseline, Key phrase, Pre-finetune, and Pre-finetune & key phrase. In the pre-finetuning phase, we use 88 prompts with 480 answers per prompt. We change the amount of data for finetuning as 10, 25, 50, 100, and 200.
  • Figure 5: QWK and standard deviation when the total number of answers used for pre-finetuning is fixed at 1,600 and the number of prompts used is varied from 1, 2, 4, 8, 16, 32, 64. For finetuning, 50 training instances were used.
  • ...and 1 more figures