Table of Contents
Fetching ...

Towards End-to-End Spoken Grammatical Error Correction

Stefano Bannò, Rao Ma, Mengjie Qian, Kate M. Knill, Mark J. F. Gales

TL;DR

This work investigates end-to-end spoken grammatical error correction (GEC) using the Whisper foundation model to replace or augment the traditional cascaded ASR–disfluency removal–GEC pipeline. End-to-end disfluency detection benefits notably from Whisper, outperforming cascaded approaches on Switchboard, while end-to-end GEC achieves performance comparable to cascaded systems on Linguaskill, though data limitations limit gains relative to text-based GEC. The study carefully discusses evaluation challenges in spoken GEC, introduces both fine-tuning and soft prompt tuning strategies, and analyzes learner feedback through ERRANT edits to assess the practical usefulness of end-to-end feedback. Overall, the results demonstrate feasibility of end-to-end spoken GEC and highlight data and feedback-generation challenges that guide future work toward improved feedback quality and more extensive training resources.

Abstract

Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.

Towards End-to-End Spoken Grammatical Error Correction

TL;DR

This work investigates end-to-end spoken grammatical error correction (GEC) using the Whisper foundation model to replace or augment the traditional cascaded ASR–disfluency removal–GEC pipeline. End-to-end disfluency detection benefits notably from Whisper, outperforming cascaded approaches on Switchboard, while end-to-end GEC achieves performance comparable to cascaded systems on Linguaskill, though data limitations limit gains relative to text-based GEC. The study carefully discusses evaluation challenges in spoken GEC, introduces both fine-tuning and soft prompt tuning strategies, and analyzes learner feedback through ERRANT edits to assess the practical usefulness of end-to-end feedback. Overall, the results demonstrate feasibility of end-to-end spoken GEC and highlight data and feedback-generation challenges that guide future work toward improved feedback quality and more extensive training resources.

Abstract

Grammatical feedback is crucial for L2 learners, teachers, and testers. Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking. This process usually relies on a cascaded pipeline comprising an ASR system, disfluency removal, and GEC, with the associated concern of propagating errors between these individual modules. In this paper, we introduce an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper. This foundation model can be used to replace the whole framework or part of it, e.g., ASR and disfluency removal. These end-to-end approaches are compared to more standard cascaded approaches on the data obtained from a free-speaking spoken language assessment test, Linguaskill. Results demonstrate that end-to-end spoken GEC is possible within this architecture, but the lack of available data limits current performance compared to a system using large quantities of text-based GEC data. Conversely, end-to-end disfluency detection and removal, which is easier for the attention-based Whisper to learn, does outperform cascaded approaches. Additionally, the paper discusses the challenges of providing feedback to candidates when using end-to-end systems for spoken GEC.
Paper Structure (13 sections, 1 equation, 2 figures, 11 tables)

This paper contains 13 sections, 1 equation, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Illustration of an E2E SGEC system and a cascaded system.
  • Figure 2: 10 most common ERRANT edit labels.