Table of Contents
Fetching ...

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

Ziyang Jiang, Xinyuan Qian, Jiahe Lei, Zexu Pan, Wei Xue, Xu-cheng Yin

TL;DR

The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150.

Abstract

TSE(Target Speaker Extraction) aims to extract the clean speech of the target speaker in an audio mixture, thus eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information (e.g., lip motions and gestures), and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Unlike all existing work, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text content, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real-time is challenging. To this end, we design two different networks. Specifically, our proposed TPE fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise, while another proposal, namely TSR, employs the contrastive learning technique to associate blindly separated speech signals with semantic cues. The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, respectively. Dataset and source code will be publicly available. Project demo page: https://slideTSE.github.io/.

pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

TL;DR

The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150.

Abstract

TSE(Target Speaker Extraction) aims to extract the clean speech of the target speaker in an audio mixture, thus eliminating irrelevant background noise and speech. While prior work has explored various auxiliary cues including pre-recorded speech, visual information (e.g., lip motions and gestures), and spatial information, the acquisition and selection of such strong cues are infeasible in many practical scenarios. Unlike all existing work, in this paper, we condition the TSE algorithm on semantic cues extracted from limited and unaligned text content, such as condensed points from a presentation slide. This method is particularly useful in scenarios like meetings, poster sessions, or lecture presentations, where acquiring other cues in real-time is challenging. To this end, we design two different networks. Specifically, our proposed TPE fuses audio features with content-based semantic cues to facilitate time-frequency mask generation to filter out extraneous noise, while another proposal, namely TSR, employs the contrastive learning technique to associate blindly separated speech signals with semantic cues. The experimental results show the efficacy in accurately identifying the target speaker by utilizing semantic cues derived from limited and unaligned text, resulting in SI-SDRi of 12.16 dB, SDRi of 12.66 dB, PESQi of 0.830 and STOIi of 0.150, respectively. Dataset and source code will be publicly available. Project demo page: https://slideTSE.github.io/.

Paper Structure

This paper contains 19 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of our proposed pTSE-T task which extracts the presenter's speech from the audio mixture (including interfering speaker's speech and background noise) using unaligned text from the visual presentation slide.
  • Figure 2: The structure of our proposed TPE network. The input consists of mixed speech waveform $x(\tau)$ and pre-processed text prompts $p_t$, which are separately processed by a speech encoder and a text encoder to obtain $X(t)$ and $P_{text}$. Subsequently, within the fusion layer, we apply $\text{FiLM}(\cdot)$ for hierarchically fusion of features from both modalities to obtain $F(t)$. The fused features are then processed by a mask estimator to generate an estimated mask $M(t)$, which is element-wise multiplied with $X(t)$ and passed through the speech decoder to obtain the predicted speech waveform $\hat{s}(\tau)$ of the target speaker.
  • Figure 3: The structure of our proposed TSR. The waveforms $speech_i$ are separated from mixed speech using DPRNN, and the $text_i$ are their corresponding text prompt. These inputs are passed through encoders and adapters to extract latent features $X_s$ and $X_t$, respectively. Next, the residual attention matrices $M_s$ and $M_t$ are generated via a cross-attention mechanism. During training, clean speech waveforms are used as inputs, whereas for inference, the inputs are the separation results from DPRNN.
  • Figure 4: The data processing pipeline for constructing our innovative MMSpeech dataset. The dashed rectangle is the input of our TSE network.
  • Figure 5: Comparison between our proposed TPE and DPRNN-TSR in terms of (a) SI-SDRi histogram; (b) average SI-SDRi against the interference SDR; (c) average accuracy against the interference SDR.