Table of Contents
Fetching ...

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

Siddhant Arora, Hayato Futami, Jee-weon Jung, Yifan Peng, Roshan Sharma, Yosuke Kashiwagi, Emiru Tsunoo, Karen Livescu, Shinji Watanabe

TL;DR

UniverSLU advances universal spoken language understanding by fine-tuning a Whisper-based encoder–decoder on a broad set of SLU tasks under instruction-driven prompts. It leverages both discrete task-specifier prompts and natural language instructions, including option lists, to guide a single model across 12 SLU task types spanning 17 datasets and 9 languages, achieving competitive or superior performance to task-specific baselines and prior prompting approaches. The work demonstrates robust generalization to unseen paraphrases of known tasks and shows promising zero-shot behavior for new datasets and languages, while also highlighting challenges with entirely unseen task types and audio-domain tasks when pretraining data are limited. Overall, UniverSLU offers a data-efficient, scalable path toward a single, multilingual SLU model with instructional control, with practical implications for cross-task, cross-language speech understanding and deployment efficiency.

Abstract

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types.

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions

TL;DR

UniverSLU advances universal spoken language understanding by fine-tuning a Whisper-based encoder–decoder on a broad set of SLU tasks under instruction-driven prompts. It leverages both discrete task-specifier prompts and natural language instructions, including option lists, to guide a single model across 12 SLU task types spanning 17 datasets and 9 languages, achieving competitive or superior performance to task-specific baselines and prior prompting approaches. The work demonstrates robust generalization to unseen paraphrases of known tasks and shows promising zero-shot behavior for new datasets and languages, while also highlighting challenges with entirely unseen task types and audio-domain tasks when pretraining data are limited. Overall, UniverSLU offers a data-efficient, scalable path toward a single, multilingual SLU model with instructional control, with practical implications for cross-task, cross-language speech understanding and deployment efficiency.

Abstract

Recent studies leverage large language models with multi-tasking capabilities, using natural language prompts to guide the model's behavior and surpassing performance of task-specific models. Motivated by this, we ask: can we build a single model that jointly performs various spoken language understanding (SLU) tasks? We start by adapting a pre-trained automatic speech recognition model to additional tasks using single-token task specifiers. We enhance this approach through instruction tuning, i.e., finetuning by describing the task using natural language instructions followed by the list of label options. Our approach can generalize to new task descriptions for the seen tasks during inference, thereby enhancing its user-friendliness. We demonstrate the efficacy of our single multi-task learning model "UniverSLU" for 12 speech classification and sequence generation task types spanning 17 datasets and 9 languages. On most tasks, UniverSLU achieves competitive performance and often even surpasses task-specific models. Additionally, we assess the zero-shot capabilities, finding that the model generalizes to new datasets and languages for seen task types.
Paper Structure (21 sections, 8 equations, 1 figure, 11 tables)

This paper contains 21 sections, 8 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Schematics of our discrete prompt-based MTL formulation. Our architecture comprises an encoder-decoder architecture pre-trained using OpenAI's Whisper model, as detailed in Sec. \ref{['sec: method']}. The figure illustrates the sequence of tokens used as prompts and predicted by the decoder during inference. We explore the use of single-token task specifiers ($S^{\text{task\_type}},S^{\text{lang}},S^{\text{data}}$ in Eq. \ref{['eq:task_specifier_formulation']}) or natural language phrases ($I^{r}$ in Eq. \ref{['eq:discrete_prompt_viterbi']}) to describe the task, as shown in Figures \ref{['fig:task specifiers']} and \ref{['fig:natural instructions']}, respectively. Colored boxes denote a sequence of tokens, while white boxes denote the functionality enabled by the sequence of tokens. SOP, SOT, NT, and TRANS are standard Whisper tokens that specify start-of-prev, start-of-transcript, no-timestamps, and transcribe, respectively.