Table of Contents
Fetching ...

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

Shashi Kumar, Iuliia Thorbecke, Sergio Burdisso, Esaú Villatoro-Tello, Manjunath K E, Kadri Hacioğlu, Pradeep Rangappa, Petr Motlicek, Aravind Ganapathiraju, Andreas Stolcke

TL;DR

The paper evaluates SLAM-ASR, a fixed-encoder + LLM pipeline connected by a trainable projector, across cross-domain and perturbation scenarios. It uses LibriSpeech, CallHome, CommonVoice, and a private ContactCenter dataset, with the projector training mirroring the original setup. Key findings show strong cross-domain degradation and sensitivity to tempo changes and noise, and that alignment between speech and text tokens is fragile when the LLM is frozen; LoRA adapters can substantially improve alignment and performance. These results provide practical guidance for deploying robust LLM-based ASR, highlighting the value of in-domain training and adapter-based fine-tuning, while noting that traditional ASR may outperform SLAM-ASR in very noisy conditions.

Abstract

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

Performance evaluation of SLAM-ASR: The Good, the Bad, the Ugly, and the Way Forward

TL;DR

The paper evaluates SLAM-ASR, a fixed-encoder + LLM pipeline connected by a trainable projector, across cross-domain and perturbation scenarios. It uses LibriSpeech, CallHome, CommonVoice, and a private ContactCenter dataset, with the projector training mirroring the original setup. Key findings show strong cross-domain degradation and sensitivity to tempo changes and noise, and that alignment between speech and text tokens is fragile when the LLM is frozen; LoRA adapters can substantially improve alignment and performance. These results provide practical guidance for deploying robust LLM-based ASR, highlighting the value of in-domain training and adapter-based fine-tuning, while noting that traditional ASR may outperform SLAM-ASR in very noisy conditions.

Abstract

Recent research has demonstrated that training a linear connector between speech foundation encoders and large language models (LLMs) enables this architecture to achieve strong ASR capabilities. Despite the impressive results, it remains unclear whether these simple approaches are robust enough across different scenarios and speech conditions, such as domain shifts and speech perturbations. In this paper, we address these questions by conducting various ablation experiments using a recent and widely adopted approach called SLAM-ASR. We present novel empirical findings that offer insights on how to effectively utilize the SLAM-ASR architecture across a wide range of settings. Our main findings indicate that SLAM-ASR exhibits poor performance in cross-domain evaluation settings. Additionally, speech perturbations on in-domain data, such as changes in speech rate or additive noise, can significantly degrade performance. Our findings offer critical insights for fine-tuning and configuring robust LLM-based ASR models, tailored to different data characteristics and computational resources.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: SLAM-ASR pipeline. The selected models and the number of parameters for the performed experiments appear between brackets.
  • Figure 2: Analysis of the impact of tempo (speech rate) and noise on WER.
  • Figure 3: Scatter plots of WER versus speech duration for SLAM-ASR (bottom) and ASR baseline (top) on the LibriSpeech test-clean set: unchanged (left) and half-speed (right).
  • Figure 4: The pairwise cosine similarity between every pair of speech and text token embeddings for two test examples before (left side) and after using LoRA (right side) in the LLM. Colors range from purple (-1, low similarity) through green (0, neutral) to yellow (+1, high similarity), using the viridis colormap.
  • Figure 5: Ground truth transcript compared to learned speech tokens for SLAM-ASR and SLAM-ASR+LoRA for the same input audio.