Generative Language Models with Retrieval Augmented Generation for Automated Short Answer Scoring
Zifan Wang, Christopher Ormerod
TL;DR
This paper introduces a retrieval-augmented generation (RAG) pipeline for Automated Short Answer Scoring (ASAS) that stores training responses in a vector database, retrieves semantically similar exemplars during inference, and uses a GLM to assign scores. The IR backbone is fine-tuned with task-specific data, and prompts are optimized using DSPy and Claude Prompt Generator, with online scoring combining top-k retrieval and GLM autoscoring. Across SemEval-2013 datasets (SCIENTSBANK 3-way/2-way and Beetle 5-way), the approach achieves state-of-the-art results and demonstrates the value of context-rich prompts and retrieval for robust scoring, including under unseen conditions. Ablation studies reveal the contributions of input fields, IR configuration, loss functions, and RAG to performance, supporting the feasibility of GLM-based ASAS with retrieval for scalable, data-efficient evaluation. The work highlights practical implications for deploying GLM-based ASAS systems with careful IR customization and prompting strategies, while acknowledging limitations and potential directions for formative assessment applications with appropriate safeguards.
Abstract
Automated Short Answer Scoring (ASAS) is a critical component in educational assessment. While traditional ASAS systems relied on rule-based algorithms or complex deep learning methods, recent advancements in Generative Language Models (GLMs) offer new opportunities for improvement. This study explores the application of GLMs to ASAS, leveraging their off-the-shelf capabilities and performance in various domains. We propose a novel pipeline that combines vector databases, transformer-based encoders, and GLMs to enhance short answer scoring accuracy. Our approach stores training responses in a vector database, retrieves semantically similar responses during inference, and employs a GLM to analyze these responses and determine appropriate scores. We further optimize the system through fine-tuned retrieval processes and prompt engineering. Evaluation on the SemEval 2013 dataset demonstrates a significant improvement on the SCIENTSBANK 3-way and 2-way tasks compared to existing methods, highlighting the potential of GLMs in advancing ASAS technology.
