Table of Contents
Fetching ...

Distilling an End-to-End Voice Assistant Without Instruction Training Data

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, Diyi Yang

TL;DR

This work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision, and shows that the Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation.

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.

Distilling an End-to-End Voice Assistant Without Instruction Training Data

TL;DR

This work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision, and shows that the Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation.

Abstract

Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using 100x less training compute.
Paper Structure (44 sections, 1 theorem, 2 equations, 7 figures, 3 tables)

This paper contains 44 sections, 1 theorem, 2 equations, 7 figures, 3 tables.

Key Result

Lemma 1

Given the probability $P_t$ from a teacher model and the probability $P_s$ from a student model, the KL Divergence is defined as $\hbox{KL}(P_t, P_s) = P_t \cdot (\log P_t - \log P_s)$. For a transformer language model, $P_s = \sigma(O_sh_s)$ where $h_s$ is the final hidden state, $O_s$ is the outp

Figures (7)

  • Figure 1: Training Pipeline for Distilled Voice Assistant (DiVA), Red indicates trainable components while Blue indicates frozen pretrained modules. DiVA modifies a text-only LLM into a general purpose Speech LLM by using the model's own responses to transcribed speech as self-supervision.
  • Figure 2: Results across our two Question Answering benchmarks covering both standard evaluation and robustness to regional accents. Model correctness is assessed using the PANDA metric, which is tuned for strong correlation with human judgments of correctness panda, and significance is from a paired bootstrap test hitchhikers.
  • Figure 3: Results across Emotion, Humor, and Sarcasm classification tasks. We measure class-weighted F1 for multi-class classification and accuracy for binary classification. Significance computed using a paired bootstrap test.
  • Figure 4: Results for Speech Translation across 7 typologically diverse languages. We evaluate using SacreBLEU and compute confidence intervals using a Paired Bootstrap.
  • Figure 5: Example of the double-blind interface for the user study with responses (Left: Qwen 2, Right: DiVA) to the speech Can you tell me about Large Language Models in the style of a haiku?.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof