Table of Contents
Fetching ...

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

Esaú Villatoro-Tello, Srikanth Madikeri, Juan Zuluaga-Gomez, Bidisha Sharma, Seyyed Saeed Sarfjoo, Iuliia Nigmatulina, Petr Motlicek, Alexei V. Ivanov, Aravind Ganapathiraju

TL;DR

This work systematically compares text-based, lattice-based, and multimodal representations for spoken language understanding under realistic ASR conditions using the SLURP dataset. It analyzes conventional NLU/SLU (HERMIT), lattice-based WCN-BERT, and MulT-based multimodal approaches, along with XLSR-53 ASR, evaluating under both manual and 1-best transcripts and examining data quality with a cleaned SLURP.F The study finds that word-consensus networks provide a modest gain over 1-best, cross-modal approaches reach near-oracle performance with a sizable relative improvement, and multimodal SLU delivers the strongest results—albeit at higher computational cost—while agreement with manual transcripts or domain-adapted ASR further boosts performance. The paper also highlights inconsistencies in SLURP, demonstrates the impact of dataset cleaning, and releases a cleaned SLURP version to support reproducibility. Overall, multimodal SLU is recommended for best accuracy in realistic settings, with traditional pipelines preferred when manual transcripts are available or ASR adaptation is feasible.

Abstract

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks

TL;DR

This work systematically compares text-based, lattice-based, and multimodal representations for spoken language understanding under realistic ASR conditions using the SLURP dataset. It analyzes conventional NLU/SLU (HERMIT), lattice-based WCN-BERT, and MulT-based multimodal approaches, along with XLSR-53 ASR, evaluating under both manual and 1-best transcripts and examining data quality with a cleaned SLURP.F The study finds that word-consensus networks provide a modest gain over 1-best, cross-modal approaches reach near-oracle performance with a sizable relative improvement, and multimodal SLU delivers the strongest results—albeit at higher computational cost—while agreement with manual transcripts or domain-adapted ASR further boosts performance. The paper also highlights inconsistencies in SLURP, demonstrates the impact of dataset cleaning, and releases a cleaned SLURP version to support reproducibility. Overall, multimodal SLU is recommended for best accuracy in realistic settings, with traditional pipelines preferred when manual transcripts are available or ASR adaptation is feasible.

Abstract

In this paper, we perform an exhaustive evaluation of different representations to address the intent classification problem in a Spoken Language Understanding (SLU) setup. We benchmark three types of systems to perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a novel 3) multimodal approach. Our work provides a comprehensive analysis of what could be the achievable performance of different state-of-the-art SLU systems under different circumstances, e.g., automatically- vs. manually-generated transcripts. We evaluate the systems on the publicly available SLURP spoken language resource corpus. Our results indicate that using richer forms of Automatic Speech Recognition (ASR) outputs, namely word-consensus-networks, allows the SLU system to improve in comparison to the 1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e., learning from acoustic and text embeddings, obtains performance similar to the oracle setup, a relative improvement of 17.8% over the 1-best configuration, being a recommended alternative to overcome the limitations of working with automatically generated transcripts.
Paper Structure (11 sections, 1 figure, 3 tables)

This paper contains 11 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Overview of the considered NLU/SLU methodologies for our performed experiments.