Table of Contents
Fetching ...

Investigation of Speaker Representation for Target-Speaker Speech Processing

Takanori Ashihara, Takafumi Moriya, Shota Horiguchi, Junyi Peng, Tsubasa Ochiai, Marc Delcroix, Kohei Matsuura, Hiroshi Sato

TL;DR

This paper compares pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker’s identity in the form of a one-hot vector to address a fundamental question: what is the preferred speaker embedding for TS tasks?

Abstract

Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. To further understand the properties of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis reveals that speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.

Investigation of Speaker Representation for Target-Speaker Speech Processing

TL;DR

This paper compares pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker’s identity in the form of a one-hot vector to address a fundamental question: what is the preferred speaker embedding for TS tasks?

Abstract

Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. To further understand the properties of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis reveals that speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.

Paper Structure

This paper contains 15 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Schematic diagrams of the SUPERB-based TS evaluation system.
  • Figure 2: Schematic diagram of gradient-based speaker embedding optimization. The forwarding process of an auxiliary network is executed only once to obtain an initial speaker embedding to be optimized.
  • Figure 3: Visualization results of speaker embedding. Each color represents each speaker. Blue square and red cross markers indicate male and female speakers, respectively. DINO and ECAPA-TDNN denote ECAPA-TDNN-DINO and ECAPA-TDNN-c1024 models.
  • Figure 4: Gradient-based optimization results. (a)Evaluation results using optimized embeddings on TS-ASR at each number of iterations. (b)Visualization results of speaker embeddings at each iteration for each speaker on the TS-ASR task. Each color represents each speaker. The colors are depicted to fade from darker to lighter as the number of iterations increases. Different markers represent different mixture conditions (m-f, m-m, and f-f indicating mixtures of male and female, male and male, and female and female, respectively).