Table of Contents
Fetching ...

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

Dominik Wagner, Alexander Churchill, Siddharth Sigtia, Erik Marchi

TL;DR

SELMA presents a unified, end-to-end Speech-Enabled Language Model that processes audio and text to jointly perform VT detection, DDSD, and ASR along with two auxiliary tasks, using a parameter-efficient LoRA-based fine-tuning and a feature pooling strategy. The model extends Qwen-Audio-Chat with a four-component architecture (audio encoder, aggregator, gating, and decoder-only LLM), and employs prompts to orchestrate multiple tasks within a single forward pass. Empirical results show SELMA improves over task-specific baselines, achieving a relative $64\%$ EER reduction on VT and $22\%$ on DDSD while maintaining competitive WER, with DET curves confirming robust performance across thresholds. The approach reduces input-processing complexity in virtual assistants and enhances generalization across tasks, indicating practical benefits for real-world VA systems. Future work will expand to end-to-end integration with downstream NLU components.

Abstract

In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal-Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

TL;DR

SELMA presents a unified, end-to-end Speech-Enabled Language Model that processes audio and text to jointly perform VT detection, DDSD, and ASR along with two auxiliary tasks, using a parameter-efficient LoRA-based fine-tuning and a feature pooling strategy. The model extends Qwen-Audio-Chat with a four-component architecture (audio encoder, aggregator, gating, and decoder-only LLM), and employs prompts to orchestrate multiple tasks within a single forward pass. Empirical results show SELMA improves over task-specific baselines, achieving a relative EER reduction on VT and on DDSD while maintaining competitive WER, with DET curves confirming robust performance across thresholds. The approach reduces input-processing complexity in virtual assistants and enhances generalization across tasks, indicating practical benefits for real-world VA systems. Future work will expand to end-to-end integration with downstream NLU components.

Abstract

In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for virtual Assistant interactions that integrates audio and text as inputs to a Large Language Model (LLM). SELMA is designed to handle three primary and two auxiliary tasks related to interactions with virtual assistants simultaneously within a single end-to-end model. We employ low-rank adaptation modules for parameter-efficient training of both the audio encoder and the LLM. Additionally, we implement a feature pooling strategy enabling the system to recognize global patterns and improve accuracy on tasks less reliant on individual sequence elements. Experimental results on Voice Trigger (VT) detection, Device-Directed Speech Detection (DDSD), and Automatic Speech Recognition (ASR), demonstrate that our approach both simplifies the typical input processing pipeline of virtual assistants significantly and also improves performance compared to dedicated models for each individual task. SELMA yields relative Equal-Error Rate improvements of 64% on the VT detection task, and 22% on DDSD, while also achieving word error rates close to the baseline.

Paper Structure

This paper contains 12 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A typical pipeline for user input processing in virtual assistants (a) and our proposed approach (b).
  • Figure 2: Overview of our approach.
  • Figure 3: DET curves for VT detection and DDSD experiments from Table \ref{['tab:exp']}. The False Accept Rate (FAR) represents either unintended queries or queries without the trigger phrase that were falsely classified as intended/containing the trigger phrase and the False Reject Rate (FRR) represents either intended queries or queries containing the trigger phrase that were falsely classified as unintended/not containing the trigger phrase. The markers on each curve indicate the EER. Dotted lines represent baselines and solid lines represent our approach.