Table of Contents
Fetching ...

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Maike Züfle, Sara Papi, Fabian Retkowski, Szymon Mazurek, Marek Kasztelnik, Alexander Waibel, Luisa Bentivogli, Jan Niehues

TL;DR

DOWIS is introduced, a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions, showing that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings.

Abstract

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.

Do What I Say: A Spoken Prompt Dataset for Instruction-Following

TL;DR

DOWIS is introduced, a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions, showing that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings.

Abstract

Speech Large Language Models (SLLMs) have rapidly expanded, supporting a wide range of tasks. These models are typically evaluated using text prompts, which may not reflect real-world scenarios where users interact with speech. To address this gap, we introduce DoWhatISay (DOWIS), a multilingual dataset of human-recorded spoken and written prompts designed to pair with any existing benchmark for realistic evaluation of SLLMs under spoken instruction conditions. Spanning 9 tasks and 11 languages, it provides 10 prompt variants per task-language pair, across five styles. Using DOWIS, we benchmark state-of-the-art SLLMs, analyzing the interplay between prompt modality, style, language, and task type. Results show that text prompts consistently outperform spoken prompts, particularly for low-resource and cross-lingual settings. Only for tasks with speech output, spoken prompts do close the gap, highlighting the need for speech-based prompting in SLLM evaluation.
Paper Structure (20 sections, 2 figures, 4 tables)

This paper contains 20 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Performance comparison for Qwen: Text Prompt vs Speech Prompts with respect to different target languages. Positive values (purple) indicate text prompt performs better, negative values indicate speech prompts perform better.
  • Figure 2: Performance comparison for Qwen2.5-Omni: Text Prompt vs Speech Prompts with respect to different prompt types. Positive values (purple) indicate text prompt performs better, negative values (yellow) indicate speech prompts perform better.