Scaling and Prompting for Improved End-to-End Spoken Grammatical Error Correction
Mengjie Qian, Rao Ma, Stefano Bannò, Kate M. Knill, Mark J. F. Gales
TL;DR
This work tackles end-to-end spoken grammatical error correction (SGEC) under data scarcity by (i) scaling training data via a pseudo-labelling pipeline that converts unlabelled audio into SGEC-training pairs, and (ii) enhancing end-to-end SGEC through prompting Whisper models with fluent transcriptions. Empirical results show that pseudo-labelling substantially narrows the gap between end-to-end SGEC and cascaded baselines for smaller models, while model size and prompting yield further gains; with large Whisper models and prompting, SGEC performance approaches the cascaded system, and feedback quality improves markedly. The study also analyzes the trade-offs of pseudo-labelled data when using larger models and demonstrates that prompting information (fluency cues) is a robust method to improve both SGEC and feedback, even under data-constrained conditions. Collectively, the findings suggest that combining pseudo-labelling, scale, and prompting can make end-to-end SGEC competitive with traditional cascaded systems, with practical implications for language learning tools and automated feedback generation.
Abstract
Spoken Grammatical Error Correction (SGEC) and Feedback (SGECF) are crucial for second language learners, teachers and test takers. Traditional SGEC systems rely on a cascaded pipeline consisting of an ASR, a module for disfluency detection (DD) and removal and one for GEC. With the rise of end-to-end (E2E) speech foundation models, we investigate their effectiveness in SGEC and feedback generation. This work introduces a pseudo-labelling process to address the challenge of limited labelled data, expanding the training data size from 77 hours to approximately 2500 hours, leading to improved performance. Additionally, we prompt an E2E Whisper-based SGEC model with fluent transcriptions, showing a slight improvement in SGEC performance, with more significant gains in feedback generation. Finally, we assess the impact of increasing model size, revealing that while pseudo-labelled data does not yield performance gain for a larger Whisper model, training with prompts proves beneficial.
