One Whisper to Grade Them All

Nhan Phan; Anusha Porwal; Yaroslav Getman; Ekaterina Voskoboinik; Tamás Grósz; Mikko Kurimo

One Whisper to Grade Them All

Nhan Phan, Anusha Porwal, Yaroslav Getman, Ekaterina Voskoboinik, Tamás Grósz, Mikko Kurimo

TL;DR

This work tackles scalable holistic Automatic Speaking Assessment for multi-part second-language tests by using a single Whisper-small encoder to process all four spoken responses without transcription. A lightweight Aggregator aggregates 30-second chunks to predict a CEFR score, achieving RMSE of $0.384$ on the Speak & Improve SLA data, beating the baseline $0.44$, and enabling strong data efficiency with a swap sampling strategy that uses only about $44.8\%$ of speaker IDs. Key contributions include a comparison of two aggregators (AVG and TF), a swap-based data augmentation method, and an analysis of reliability and validity, showing the model is robust and fast at inference (CPU ~60s for a 4-part test; GPU <1s) with model sizes around 154–168M parameters. The results demonstrate practical potential for large-scale CALL systems but highlight the limitation of acoustic-only ASA in content sensitivity, suggesting future work to integrate content-aware components for improved validity and actionable feedback.

Abstract

We present an efficient end-to-end approach for holistic Automatic Speaking Assessment (ASA) of multi-part second-language tests, developed for the 2025 Speak & Improve Challenge. Our system's main novelty is the ability to process all four spoken responses with a single Whisper-small encoder, combine all information via a lightweight aggregator, and predict the final score. This architecture removes the need for transcription and per-part models, cuts inference time, and makes ASA practical for large-scale Computer-Assisted Language Learning systems. Our system achieved a Root Mean Squared Error (RMSE) of 0.384, outperforming the text-based baseline (0.44) while using at most 168M parameters (about 70% of Whisper-small). Furthermore, we propose a data sampling strategy, allowing the model to train on only 44.8% of the speakers in the corpus and still reach 0.383 RMSE, demonstrating improved performance on imbalanced classes and strong data efficiency.

One Whisper to Grade Them All

TL;DR

Abstract

One Whisper to Grade Them All

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)