Table of Contents
Fetching ...

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

Keqi Deng, Guangzhi Sun, Philip C. Woodland

TL;DR

Wav2Prompt enables end-to-end speech prompts for fixed text-based LLMs, training solely on ASR data to produce a label-level speech representation aligned with LLM token embeddings via CIF and a dedicated loss. By treating LLM embeddings as training targets and using a continuous integrate-and-fire alignment, it preserves zero-shot capabilities while enabling few-shot E2E fine-tuning without updating the LLM. Across speech translation, SLU, SQA, and SQQA, Wav2Prompt matches or exceeds the zero-shot performance of ASR-LLM cascades and substantially outperforms Encoder-LLM, with notable BLEU gains (e.g., 8.5 BLEU on En–Fr ST with BLOOMZ-7B1). The approach offers a practical, resource-efficient path to extending LLMs to spoken-language tasks with potential for broad multilingual application and easy adaptation via prompt templates and limited paired data. $L_{ m Train} = L_{ m CE} + obreak obreak abla obreak obreak obreak obreak obreak obreak obreak$ (see training losses) and demonstrates strong few-shot gains while maintaining zero-shot flexibility.

Abstract

Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available in few-shot scenarios, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned. The Wav2Prompt-LLM combination then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST with the BLOOMZ-7B1 LLM, a Wav2Prompt-LLM combination gave a 8.5 BLEU point increase over an ASR-LLM cascade.

Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning

TL;DR

Wav2Prompt enables end-to-end speech prompts for fixed text-based LLMs, training solely on ASR data to produce a label-level speech representation aligned with LLM token embeddings via CIF and a dedicated loss. By treating LLM embeddings as training targets and using a continuous integrate-and-fire alignment, it preserves zero-shot capabilities while enabling few-shot E2E fine-tuning without updating the LLM. Across speech translation, SLU, SQA, and SQQA, Wav2Prompt matches or exceeds the zero-shot performance of ASR-LLM cascades and substantially outperforms Encoder-LLM, with notable BLEU gains (e.g., 8.5 BLEU on En–Fr ST with BLOOMZ-7B1). The approach offers a practical, resource-efficient path to extending LLMs to spoken-language tasks with potential for broad multilingual application and easy adaptation via prompt templates and limited paired data. (see training losses) and demonstrates strong few-shot gains while maintaining zero-shot flexibility.

Abstract

Wav2Prompt is proposed which allows straightforward integration between spoken input and a text-based large language model (LLM). Wav2Prompt uses a simple training process with only the same data used to train an automatic speech recognition (ASR) model. After training, Wav2Prompt learns continuous representations from speech and uses them as LLM prompts. To avoid task over-fitting issues found in prior work and preserve the emergent abilities of LLMs, Wav2Prompt takes LLM token embeddings as the training targets and utilises a continuous integrate-and-fire mechanism for explicit speech-text alignment. Therefore, a Wav2Prompt-LLM combination can be applied to zero-shot spoken language tasks such as speech translation (ST), speech understanding (SLU), speech question answering (SQA) and spoken-query-based QA (SQQA). It is shown that for these zero-shot tasks, Wav2Prompt performs similarly to an ASR-LLM cascade and better than recent prior work. If relatively small amounts of task-specific paired data are available in few-shot scenarios, the Wav2Prompt-LLM combination can be end-to-end (E2E) fine-tuned. The Wav2Prompt-LLM combination then yields greatly improved results relative to an ASR-LLM cascade for the above tasks. For instance, for English-French ST with the BLOOMZ-7B1 LLM, a Wav2Prompt-LLM combination gave a 8.5 BLEU point increase over an ASR-LLM cascade.
Paper Structure (35 sections, 7 equations, 1 figure, 8 tables)

This paper contains 35 sections, 7 equations, 1 figure, 8 tables.

Figures (1)

  • Figure 1: Illustration of the proposed Wav2Prompt architecture. $\bm{\oplus}$ denotes addition. Prefix and postfix text are task-specific prompt templates that can contain instructions. Their embeddings are obtained through the LLM embedding layer, and the transcript token embeddings are the same.