A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Dominik Wagner; Alexander Churchill; Siddharth Sigtia; Panayiotis Georgiou; Matt Mirsamadi; Aarshee Mishra; Erik Marchi

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Panayiotis Georgiou, Matt Mirsamadi, Aarshee Mishra, Erik Marchi

TL;DR

This work explores whether it is feasible to drop the requirement that users must begin each command with a trigger phrase, and takes the decoder outputs of an automatic speech recognition (ASR) system as input features to a large language model (LLM).

Abstract

Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a large language model (LLM). Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

TL;DR

Abstract

Paper Structure (17 sections, 2 equations, 2 figures, 2 tables)

This paper contains 17 sections, 2 equations, 2 figures, 2 tables.

Introduction
Data
Training Data
Text-only Training Data
Evaluation Data
Feature Extraction
Text and ASR Features
Acoustic Features
Method
Large Language Models
Mapping Networks
Training Details
Unimodal Baselines
Experiments
Discussion
...and 2 more sections

Figures (2)

Figure 1: Architecture of the multimodal system. The weights of the grey-shaded components ($M_1$, $M_2$ and LLM) are trained, all other components remain frozen. The unimodal baselines differ from the multimodal system as follows: In the text-only variant, the mapping networks $M_1$ and $M_2$ are removed and the only input features are the 1-best hypotheses of the ASR system. In the audio-only variant, the decoder signals including $M_2$ and the 1-best hypotheses are removed. The DS-only system relies only on the decoder signal input, which is transformed via $M_2$, i.e., $M_1$ and the 1-best hypotheses are removed from the overall system.
Figure 2: DET curves for a selection of experiments from Table \ref{['tab:exp']} and Table \ref{['tab:abl']}. The false accept rate (FAR) represents non-directed utterances that were falsely classified as directed utterances and the false reject rate (FRR) represents directed utterances that were falsely classified as non-directed utterances. Dotted lines show unimodal baselines (UM2 and UM4) and solid lines show multimodal experiments (MM6 and MM6.3). The points on each curve indicate the EER of the respective experiment.

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

TL;DR

Abstract

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (2)