Table of Contents
Fetching ...

Speechworthy Instruction-tuned Language Models

Hyundong Cho, Nicolaas Jedema, Leonardo F. R. Ribeiro, Karishma Sharma, Pedro Szekely, Alessandro Moschitti, Ruben Janssen, Jonathan May

TL;DR

This work explores i) prompting strategies based on radio-industry best practices and ii) preference learning using a novel speech-based preference data of 20K samples collected by annotators who listen to response pairs, showing that both prompting and preference learning increase the speech-suitability of popular instruction tuned LLMs.

Abstract

Current instruction-tuned language models are exclusively trained with textual preference data and thus are often not aligned with the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore (i) prompting strategies grounded in radio-industry best practices and (ii) preference learning using a novel speech-based preference data of 20K samples, generated with a wide spectrum of prompts that induce varying dimensions of speech-suitability and labeled by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. Interestingly, we find that prompting and preference learning can be additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.

Speechworthy Instruction-tuned Language Models

TL;DR

This work explores i) prompting strategies based on radio-industry best practices and ii) preference learning using a novel speech-based preference data of 20K samples collected by annotators who listen to response pairs, showing that both prompting and preference learning increase the speech-suitability of popular instruction tuned LLMs.

Abstract

Current instruction-tuned language models are exclusively trained with textual preference data and thus are often not aligned with the unique requirements of other modalities, such as speech. To better align language models with the speech domain, we explore (i) prompting strategies grounded in radio-industry best practices and (ii) preference learning using a novel speech-based preference data of 20K samples, generated with a wide spectrum of prompts that induce varying dimensions of speech-suitability and labeled by annotators who listen to response pairs. Both human and automatic evaluation show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. Interestingly, we find that prompting and preference learning can be additive; combining them achieves the best win rates in head-to-head comparison, resulting in responses that are preferred or tied to the base model in 76.2% of comparisons on average. Lastly, we share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
Paper Structure (42 sections, 1 equation, 9 figures, 10 tables)

This paper contains 42 sections, 1 equation, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Current instruction-tuned language models tend to generate verbose responses with non-vocalizable content, such as bullet lists or parentheses, that are not suitable for responses that are delivered as speech by voice assistants (left, from OLMo 7B Instruct). Speech is serial and transient, and therefore concise yet informative responses with conversational follow-up questions are often preferred (right, from our adapted OLMo model).
  • Figure 2: Preference learning method overview. Since we only have an approximate idea of what makes a good spoken response, we first compile a set of system prompts intended to vary the speech suitability of generated responses. We sample a pair to generate responses from various ITLM s to further diversify responses, transform them to speech with a TTS service, and human annotators rank their preferences after listening.
  • Figure 3: Head-to-head human evaluation results for OLMo (left) and Falcon (right). If the win rate is higher than the loss rate, the model mentioned second in the y-axis ($B$ for $A$ vs $B$) is more often preferred in a speech setting.
  • Figure 4: Head-to-head human evaluation results with our prompts using GPT-4. This figure takes the same format as \ref{['fig:human_eval_main']}.
  • Figure 5: Falcon's DPO training trajectory suggests that prompts help the preference learning process by providing useful initial guidance for distinguishing between chosen and rejected responses.
  • ...and 4 more figures