Table of Contents
Fetching ...

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

Mohan Li, Cong-Thanh Do, Simon Keizer, Youmna Farag, Svetlana Stoyanchev, Rama Doddipatla

TL;DR

WHISMA introduces a zero-shot capable speech-LLM by fusing a fixed Whisper encoder with a fixed Llama-3 decoder via a trainable modality aligner and LoRA adapters, enabling end-to-end SLU across multiple tasks. Trained on ~2000 hours of multi-task data including ASR, IC, SF, SQA, and SIT, plus Spoken-Alpaca to bolster generalisation, WHISMA also supports an auxiliary ASR step during inference through SCoT or MR while maintaining end-to-end decoding. The authors release Spoken-Alpaca and SLU-GLUE for reproducibility and evaluate WHISMA across STSC, STUC, and UTUC, achieving state-of-the-art zero-shot performance on SLURP SF and strong generalisation to unseen tasks, outperforming Qwen-Audio and modular baselines. The work demonstrates the viability of unified speech-LLMs for broad SLU applications and provides practical resources to advance reproducibility in the field.

Abstract

Speech large language models (speech-LLMs) integrate speech and text-based foundation models to provide a unified framework for handling a wide range of downstream tasks. In this paper, we introduce WHISMA, a speech-LLM tailored for spoken language understanding (SLU) that demonstrates robust performance in various zero-shot settings. WHISMA combines the speech encoder from Whisper with the Llama-3 LLM, and is fine-tuned in a parameter-efficient manner on a comprehensive collection of SLU-related datasets. Our experiments show that WHISMA significantly improves the zero-shot slot filling performance on the SLURP benchmark, achieving a relative gain of 26.6% compared to the current state-of-the-art model. Furthermore, to evaluate WHISMA's generalisation capabilities to unseen domains, we develop a new task-agnostic benchmark named SLU-GLUE. The evaluation results indicate that WHISMA outperforms an existing speech-LLM (Qwen-Audio) with a relative gain of 33.0%.

WHISMA: A Speech-LLM to Perform Zero-shot Spoken Language Understanding

TL;DR

WHISMA introduces a zero-shot capable speech-LLM by fusing a fixed Whisper encoder with a fixed Llama-3 decoder via a trainable modality aligner and LoRA adapters, enabling end-to-end SLU across multiple tasks. Trained on ~2000 hours of multi-task data including ASR, IC, SF, SQA, and SIT, plus Spoken-Alpaca to bolster generalisation, WHISMA also supports an auxiliary ASR step during inference through SCoT or MR while maintaining end-to-end decoding. The authors release Spoken-Alpaca and SLU-GLUE for reproducibility and evaluate WHISMA across STSC, STUC, and UTUC, achieving state-of-the-art zero-shot performance on SLURP SF and strong generalisation to unseen tasks, outperforming Qwen-Audio and modular baselines. The work demonstrates the viability of unified speech-LLMs for broad SLU applications and provides practical resources to advance reproducibility in the field.

Abstract

Speech large language models (speech-LLMs) integrate speech and text-based foundation models to provide a unified framework for handling a wide range of downstream tasks. In this paper, we introduce WHISMA, a speech-LLM tailored for spoken language understanding (SLU) that demonstrates robust performance in various zero-shot settings. WHISMA combines the speech encoder from Whisper with the Llama-3 LLM, and is fine-tuned in a parameter-efficient manner on a comprehensive collection of SLU-related datasets. Our experiments show that WHISMA significantly improves the zero-shot slot filling performance on the SLURP benchmark, achieving a relative gain of 26.6% compared to the current state-of-the-art model. Furthermore, to evaluate WHISMA's generalisation capabilities to unseen domains, we develop a new task-agnostic benchmark named SLU-GLUE. The evaluation results indicate that WHISMA outperforms an existing speech-LLM (Qwen-Audio) with a relative gain of 33.0%.
Paper Structure (11 sections, 2 figures, 6 tables)

This paper contains 11 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: An overview of WHISMA model architecture.
  • Figure 2: E2E Inference strategies integrating ASR to SLU.