M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Anna Wang; Da Liu; Zhiyu Zhang; Shengqiang Liu; Jie Gao; Yali Li

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

TL;DR

M$^{3}$V tackles robust device-directed speech detection under ASR errors by framing the problem as a multi-view, multi-modal task. It combines unimodal text and audio encoders (GPT-2 and Wav2vec2) with a multi-modal fusion and a text–audio alignment view learned via a contrastive InfoNCE objective, producing four complementary views that feed a policy decision module. An adaptive loss balances four objectives, enabling joint optimization across modalities and views. Experiments on in-vehicle NOMI data show that M$^{3}$V outperforms unimodal and standard multi-modal baselines, achieving 96.27% accuracy with 4.94% EER on normal data and 95.71% accuracy on ASR-error data, surpassing human judgment in the ASR-error setting. The work demonstrates practical benefits for naturalistic VA interactions and robustness to ASR noise, with potential extensions to richer dialogue features.

Abstract

With the goal of more natural and human-like interaction with virtual voice assistants, recent research in the field has focused on full duplex interaction mode without relying on repeated wake-up words. This requires that in scenes with complex sound sources, the voice assistant must classify utterances as device-oriented or non-device-oriented. The dual-encoder structure, which is jointly modeled by text and speech, has become the paradigm of device-directed speech detection. However, in practice, these models often produce incorrect predictions for unaligned input pairs due to the unavoidable errors of automatic speech recognition (ASR).To address this challenge, we propose M$^{3}$V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M$^{3}$V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time.

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

TL;DR

V tackles robust device-directed speech detection under ASR errors by framing the problem as a multi-view, multi-modal task. It combines unimodal text and audio encoders (GPT-2 and Wav2vec2) with a multi-modal fusion and a text–audio alignment view learned via a contrastive InfoNCE objective, producing four complementary views that feed a policy decision module. An adaptive loss balances four objectives, enabling joint optimization across modalities and views. Experiments on in-vehicle NOMI data show that M

V outperforms unimodal and standard multi-modal baselines, achieving 96.27% accuracy with 4.94% EER on normal data and 95.71% accuracy on ASR-error data, surpassing human judgment in the ASR-error setting. The work demonstrates practical benefits for naturalistic VA interactions and robustness to ASR noise, with potential extensions to richer dialogue features.

Abstract

V, a multi-modal multi-view approach for device-directed speech detection, which frames we frame the problem as a multi-view learning task that introduces unimodal views and a text-audio alignment view in the network besides the multi-modal. Experimental results show that M

V significantly outperforms models trained using only single or multi-modality and surpasses human judgment performance on ASR error data for the first time.

Paper Structure (11 sections, 6 equations, 1 figure, 2 tables, 2 algorithms)

This paper contains 11 sections, 6 equations, 1 figure, 2 tables, 2 algorithms.

Introduction
Method
Multi-Modal Learning
Multi-view Learning
Adaptive Learning
Policy Decision Module
Experiments
Datasets
Multi-Modal and Multi-View Experiments
Policy decision experiments
CONCLUSION

Figures (1)

Figure 1: Overview architecture of the M$^{3}$V. Given a text-audio pair as input, M$^{3}$V projects it to four views: unimodal-view ($V_{a}$, $V_{t}$), multi-modal-view($V_{m}$), aligned-view ($V_{algin}$). The predicted probabilities of these views are subject to arbitration by a policy decision module.

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

TL;DR

Abstract

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (1)