Table of Contents
Fetching ...

Target Speech Diarization with Multimodal Prompts

Yidi Jiang, Ruijie Tao, Zhengyang Chen, Yanmin Qian, Haizhou Li

TL;DR

This work proposes a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts.

Abstract

Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in the proposed framework. Furthermore, our framework demonstrates versatility in performing various signal processing tasks, including speaker diarization and overlap speech detection, using task-specific prompts. MM-TSD achieves robust and comparable performance as a unified system compared to specialized models. Moreover, MM-TSD shows capability to handle complex conversations for real-world dataset.

Target Speech Diarization with Multimodal Prompts

TL;DR

This work proposes a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts.

Abstract

Traditional speaker diarization seeks to detect ``who spoke when'' according to speaker characteristics. Extending to target speech diarization, we detect ``when target event occurs'' according to the semantic characteristics of speech. We propose a novel Multimodal Target Speech Diarization (MM-TSD) framework, which accommodates diverse and multi-modal prompts to specify target events in a flexible and user-friendly manner, including semantic language description, pre-enrolled speech, pre-registered face image, and audio-language logical prompts. We further propose a voice-face aligner module to project human voice and face representation into a shared space. We develop a multi-modal dataset based on VoxCeleb2 for MM-TSD training and evaluation. Additionally, we conduct comparative analysis and ablation studies for each category of prompts to validate the efficacy of each component in the proposed framework. Furthermore, our framework demonstrates versatility in performing various signal processing tasks, including speaker diarization and overlap speech detection, using task-specific prompts. MM-TSD achieves robust and comparable performance as a unified system compared to specialized models. Moreover, MM-TSD shows capability to handle complex conversations for real-world dataset.
Paper Structure (45 sections, 2 equations, 7 figures, 9 tables)

This paper contains 45 sections, 2 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The illustration of four types of prompts supported by our Multimodal Target Speech Diarization (MM-TSD) framework. The unified target speech diarization model can accommodate multi-modal and diverse prompts, including semantic language description, pre-enrolled speech and pre-registered face of the target speaker and the audio-text logical controller. Our framework then detects the activity regions of the target speech specified by the prompt.
  • Figure 2: Our MM-TSD framework takes an audio signal and a switchable multi-modal prompt as inputs, to output frame-wise binary classification of the prompt-specified speech event. It accommodates diverse prompt types such as semantic language descriptions, pre-enrolled speech, pre-registered face images or the combination of audio-text logical prompts to specify the target event. This framework comprises a speech encoder, three modality-specific prompt encoders and a Transformer encoder-decoder structure. Then a dot product $\bigotimes$ is applied between the encoder and decoder outputs, followed by a sigmoid operation $\sigma$ to calculate the target event occurrence probability at each frame.
  • Figure 3: Text prompt encoding. The textual prompt is first processed by a tokenizer to generate word tokens, including a "[CLS]" token at the beginning. We utilize a pre-trained DistilBERT encoder with Low-Rank Adaptation (LoRA) to derive the sentence embedding. The feature of the "[CLS]" token is then used as the prompt embedding $E$.
  • Figure 4: Voice-face alignment involves inputs from speech segments and face images belonging to the same individual, which are denoted within dashed boxes. We utilize pre-trained ECAPA-TDNN as the speaker encoder and ResNet50 as the face encoder to extract respective embeddings from the speech segment and face image. Following this, a voice-face aligner is employed to match face identity with voice characteristics in a shared embedding space. During the aligner training phase, the voice-face aligner is trained using Mean Squared Error (MSE) loss. In the subsequent MM-TSD training phase, the visual prompt encoder and voice-face aligner are both frozen to derive the visual prompt embedding $E$.
  • Figure 5: The encoder receives the speech embedding $F^a$, which is extracted from the speech encoder, and produces a frame-level speech representation $F^e$. The decoder utilizes the prompt embedding $E$ as the query within a cross-attention mechanism, with $F^e$ serving as both the key and value. This setup enables precise alignment and interaction between the speech embedding $F^a$ and prompt embedding $E$ to detect the prompt-specified target event activities within the speech signal. $\bigotimes$ denotes the dot product operation between transformer encoder and decoder ouputs.
  • ...and 2 more figures