Table of Contents
Fetching ...

MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

Yuezhang Peng, Chonghao Cai, Ziang Liu, Shuai Fan, Sheng Jiang, Hua Xu, Yuxin Liu, Qiguang Chen, Kele Xu, Yao Li, Sheng Wang, Libo Qin, Xie Chen

TL;DR

This work introduces MAC-SLU, a Chinese multi-intent SLU dataset tailored for automotive cabin scenarios to address data diversity and benchmark gaps. It pairs real-world text with TTS-generated Mandarin speech, spanning 8 domains, 81 intents, and 192 slots, including multi-intent queries. A unified benchmark evaluates open-source LLMs/LALMs across direct inference, in-context learning, and supervised fine-tuning, comparing pipeline and end-to-end SLU approaches; findings show SFT substantially outperforms in-context learning, and E2E LALMs can rival pipelines by avoiding ASR error propagation. The authors release code and data to enable fair comparisons and advocate future work on improving in-context capabilities and semantically aligned evaluation metrics.

Abstract

Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.

MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

TL;DR

This work introduces MAC-SLU, a Chinese multi-intent SLU dataset tailored for automotive cabin scenarios to address data diversity and benchmark gaps. It pairs real-world text with TTS-generated Mandarin speech, spanning 8 domains, 81 intents, and 192 slots, including multi-intent queries. A unified benchmark evaluates open-source LLMs/LALMs across direct inference, in-context learning, and supervised fine-tuning, comparing pipeline and end-to-end SLU approaches; findings show SFT substantially outperforms in-context learning, and E2E LALMs can rival pipelines by avoiding ASR error propagation. The authors release code and data to enable fair comparisons and advocate future work on improving in-context capabilities and semantically aligned evaluation metrics.

Abstract

Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{https://github.com/Gatsby-web/MAC\_SLU} and datasets\footnote{huggingface.co/datasets/Gatsby1984/MAC\_SLU} are released publicly.

Paper Structure

This paper contains 13 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: In-context learning prompt template for jointly SLU task. The intent lists, slot lists, and format are partially omitted for brevity.