Table of Contents
Fetching ...

Assessment of L2 Oral Proficiency using Speech Large Language Models

Rao Ma, Mengjie Qian, Siyuan Tang, Stefano Bannò, Kate M. Knill, Mark J. F. Gales

TL;DR

This paper investigates the use of speech large language models (LLMs) for holistic L2 English speaking proficiency assessment, addressing information-loss in cascaded ASR-plus-model pipelines and limitations of end-to-end baselines. It compares three training targets—Cross Entropy (CE), Fair Average Loss (FA), and regression (Reg)—and two decoding modes (hard vs soft) on Linguaskill and Speak & Improve datasets, using Qwen2Audio with LoRA adaptation. The FA loss with soft decoding and LoRA achieves the strongest results, reaching PCCs up to $0.954$ on LinGen and $0.938$ on LinBus, and demonstrates robust generalisation across test parts and even across tasks. The approach enables direct score prediction from audio without ASR decoding, showing zero-shot capabilities and promising generalisation, which has practical implications for scalable, consistent, and efficient SLA systems.

Abstract

The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.

Assessment of L2 Oral Proficiency using Speech Large Language Models

TL;DR

This paper investigates the use of speech large language models (LLMs) for holistic L2 English speaking proficiency assessment, addressing information-loss in cascaded ASR-plus-model pipelines and limitations of end-to-end baselines. It compares three training targets—Cross Entropy (CE), Fair Average Loss (FA), and regression (Reg)—and two decoding modes (hard vs soft) on Linguaskill and Speak & Improve datasets, using Qwen2Audio with LoRA adaptation. The FA loss with soft decoding and LoRA achieves the strongest results, reaching PCCs up to on LinGen and on LinBus, and demonstrates robust generalisation across test parts and even across tasks. The approach enables direct score prediction from audio without ASR decoding, showing zero-shot capabilities and promising generalisation, which has practical implications for scalable, consistent, and efficient SLA systems.

Abstract

The growing population of L2 English speakers has increased the demand for developing automatic graders for spoken language assessment (SLA). Historically, statistical models, text encoders, and self-supervised speech models have been utilised for this task. However, cascaded systems suffer from the loss of information, while E2E graders also have limitations. With the recent advancements of multi-modal large language models (LLMs), we aim to explore their potential as L2 oral proficiency graders and overcome these issues. In this work, we compare various training strategies using regression and classification targets. Our results show that speech LLMs outperform all previous competitive baselines, achieving superior performance on two datasets. Furthermore, the trained grader demonstrates strong generalisation capabilities in the cross-part or cross-task evaluation, facilitated by the audio understanding knowledge acquired during LLM pre-training.

Paper Structure

This paper contains 14 sections, 5 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: References vs predictions using Audio2Qwen graders.