On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Siddhant Arora; Ankita Pasad; Chung-Ming Chien; Jionghao Han; Roshan Sharma; Jee-weon Jung; Hira Dhamyal; William Chen; Suwon Shon; Hung-yi Lee; Karen Livescu; Shinji Watanabe

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Siddhant Arora, Ankita Pasad, Chung-Ming Chien, Jionghao Han, Roshan Sharma, Jee-weon Jung, Hira Dhamyal, William Chen, Suwon Shon, Hung-yi Lee, Karen Livescu, Shinji Watanabe

TL;DR

The paper introduces SLUE-PERB, an open benchmark to evaluate pre-trained speech foundation models (SFMs) on complex, natural-speech SLU tasks. By comparing SSL, supervised ASR, and supervised SLU SFMs under three integration strategies (frozen with lightweight head, frozen with complex head, and fine-tuned with lightweight head), the study reveals that SSL SFMs excel in sequence-generation tasks while supervised ASR SFMs shine in classification; a complex prediction head generally yields the best overall performance, albeit with higher latency. The findings emphasize task- and data-dependent model selection, show that performance gaps shrink with more powerful heads or fine-tuning, and provide practical tradeoffs for inference speed and training cost. The authors also release an open-source toolkit and leaderboard to standardize evaluation and spur further research in SLU representations.

Abstract

The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for open resources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies.

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

TL;DR

Abstract

Paper Structure (23 sections, 7 figures, 6 tables)

This paper contains 23 sections, 7 figures, 6 tables.

Introduction
Related Work
Pre-trained speech foundation models
Performance benchmarks
The SLUE-PERB benchmark
Tasks
Pre-trained speech foundation models
Evaluation Protocols
Experiments
Results
Lightweight prediction head
Do performance trends change with different modeling strategies?
Discussion
Is there an overall best model?
Performance-compute tradeoffs
...and 8 more sections

Figures (7)

Figure 1: Performance of various SSL SFMs with a lightweight prediction head on SLUE tasks.
Figure 2: Performance of various supervised ASR SFMs with a lightweight prediction head on SLUE tasks.
Figure 3: Performance of best performing SSL and ASR SFMs with a lightweight prediction head on SLUE tasks. The label for each bar is the specific SFM chosen.
Figure 4: ASR performance of SFMs with a lightweight prediction head on VoxCeleb and VoxPopuli datasets.
Figure 5: Performance of best performing SSL and ASR SFMs with complex prediction head on SLUE tasks. The label for each bar is the specific SFM chosen.
...and 2 more figures

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

TL;DR

Abstract

On the Evaluation of Speech Foundation Models for Spoken Language Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)