Table of Contents
Fetching ...

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Ognjen, Rudovic, Pranay Dighe, Yi Su, Vineet Garg, Sameer Dharur, Xiaochuan Niu, Ahmed H. Abdelaziz, Saurabh Adya, Ahmed Tewfik

TL;DR

This work explores the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups, via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM.

Abstract

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

TL;DR

This work explores the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups, via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM.

Abstract

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection (DDSD) from the follow-up queries is critical for enabling naturalistic user experience. To this end, we explore the notion of Large Language Models (LLMs) and model the first query when making inference about the follow-ups (based on the ASR-decoded text), via prompting of a pretrained LLM, or by adapting a binary classifier on top of the LLM. In doing so, we also exploit the ASR uncertainty when designing the LLM prompts. We show on the real-world dataset of follow-up conversations that this approach yields large gains (20-40% reduction in false alarms at 10% fixed false rejects) due to the joint modeling of the previous speech context and ASR uncertainty, compared to when follow-ups are modeled alone.

Paper Structure

This paper contains 11 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Follow-up conversations: the pair of user queries are first processed by an ASR system, which outputs the text transcriptions of the user's speech. The joint ASR transcription of the initial and follow-up queries are input to the LLM that detects if the latter is directed to the Virtual Assistant (VA).
  • Figure 2: Task-prompt used for Device-directed Speech Detection. Note that for the follow-up query, the text in italics is additionally added when including the n-best hypothesis by the ASR system.
  • Figure 3: DETs showing the accuracy of the models across a range of regimes. Next to the model names in the brackets, we show the type of ASR hypothesis for each model reported ($1$- and $n=8$,w found on the validation set to work best for the task). The lines in black and red represent the models without and with context of the previous query, respectively. The green dot shows the accuracy of the best prompting-based approach ("FinetunePrompt" with LoRA adapters).