Table of Contents
Fetching ...

Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Ju Lin, Jing Pan, Ruizhi Li, Ming Sun, Yuzong Liu, Alaa Hassan, Jing Zheng, Florian Metze

TL;DR

The paper addresses the challenge of directional, multi-talker speech understanding for LLMs in wearable smart-glasses by proposing two complementary pipelines: (1) a cascaded streaming source-separation front-end that produces speaker-tagged prompts for an LLM, enabling streaming ASR and translation, and (2) an end-to-end serialized Output Training (SOT) approach that fine-tunes LLMs on multi-channel inputs with directional beamforming and LoRA. The work demonstrates that a streaming SS front-end combined with LLM prompting (SS+SLM) yields strong recognition and translation performance with accurate speaker attribution for wearers, while an SOT-based SLM (SOT+SLM) offers robustness to overlapping speech. Through simulated multichannel experiments with multilingual data and Fleurs/MLS baselines, the authors show improved BLEU scores and SI-SDR metrics, and analyze trade-offs in speaker attribution and overlap handling. Practical significance lies in enabling real-time captions and translations in smart glasses, with potential impact on accessibility and multilingual communication in dynamic environments.

Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.

Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

TL;DR

The paper addresses the challenge of directional, multi-talker speech understanding for LLMs in wearable smart-glasses by proposing two complementary pipelines: (1) a cascaded streaming source-separation front-end that produces speaker-tagged prompts for an LLM, enabling streaming ASR and translation, and (2) an end-to-end serialized Output Training (SOT) approach that fine-tunes LLMs on multi-channel inputs with directional beamforming and LoRA. The work demonstrates that a streaming SS front-end combined with LLM prompting (SS+SLM) yields strong recognition and translation performance with accurate speaker attribution for wearers, while an SOT-based SLM (SOT+SLM) offers robustness to overlapping speech. Through simulated multichannel experiments with multilingual data and Fleurs/MLS baselines, the authors show improved BLEU scores and SI-SDR metrics, and analyze trade-offs in speaker attribution and overlap handling. Practical significance lies in enabling real-time captions and translations in smart glasses, with potential impact on accessibility and multilingual communication in dynamic environments.

Abstract

Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
Paper Structure (11 sections, 1 equation, 2 figures, 2 tables)

This paper contains 11 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An illustration of a cascaded directional system comprising Source Separation and SLMs.
  • Figure 2: An illustration of MC-SOT based SLM post-training.