Table of Contents
Fetching ...

Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction

Wang Dai, Archontis Politis, Tuomas Virtanen

TL;DR

Experimental results demonstrate that certain relative cues can surpass the performance of an audio-based TSE system, and a two-stage TSE framework proposed substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics.

Abstract

This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions often lost in absolute categorical representations. Building on this analysis, we propose a two-stage TSE framework, in which a speech separation model generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Using this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues in terms of both classification accuracy and TSE performance. Experimental results demonstrate that (i) relative cues achieve higher overall classification accuracy and improved TSE performance compared with independent cues, (ii) the two-stage framework substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics, and (iii) certain relative cues (language, gender, loudness, distance, temporal order, speaking duration, random cue and all cue) can surpass the performance of an audio-based TSE system. Further analysis reveals notable differences in discriminative power across cue types, providing insights into the effectiveness of different relative cues for TSE.

Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction

TL;DR

Experimental results demonstrate that certain relative cues can surpass the performance of an audio-based TSE system, and a two-stage TSE framework proposed substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics.

Abstract

This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions often lost in absolute categorical representations. Building on this analysis, we propose a two-stage TSE framework, in which a speech separation model generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Using this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues in terms of both classification accuracy and TSE performance. Experimental results demonstrate that (i) relative cues achieve higher overall classification accuracy and improved TSE performance compared with independent cues, (ii) the two-stage framework substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics, and (iii) certain relative cues (language, gender, loudness, distance, temporal order, speaking duration, random cue and all cue) can surpass the performance of an audio-based TSE system. Further analysis reveals notable differences in discriminative power across cue types, providing insights into the effectiveness of different relative cues for TSE.
Paper Structure (18 sections, 15 equations, 7 figures, 7 tables)

This paper contains 18 sections, 15 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Illustration of quantization schemes for continuous-valued attributes. (a) Independent cues: the attribute value space of the target and interfering speakers, denoted by $x_i^{(\mathit{tar})}$ (y-axis) and $x_i^{(\mathit{inf})}$ (x-axis), is partitioned into a small number of coarse categories (three shown as in the example). Shaded regions indicate attribute values of the two speakers that are quantized into the same category, in which case the two speakers cannot be distinguished based on the attribute. (b) Relative cues for absolute differences: attribute values of the target and interfering speakers are compared with absolute difference. The shaded Similar region corresponds to $|\Delta x_i^{(*)}| \le \theta_i$, indicating differences below the discrimination threshold. Regions outside the shaded area represent perceptually distinct absolute differences. (c) Relative cues for percentage differences: attribute values of the target and interfering speakers are compared with percentage difference. The shaded Similar region corresponds to $|\Delta x_i^{(*)}| \le \theta_i$, indicating differences below the discrimination threshold. Because the comparison is based on relative differences, the Similar region gradually expands from the zero point. Regions outside the shaded area represent perceptually distinct relative differences.
  • Figure 2: Overview of the proposed two-stage text-guided TSE system during inference.
  • Figure 3: Training and Validation Loss Comparison for TF-Locoformer-based single-stage TSE and TF-Locoformer-based Separation models over epochs. The loss is measured such that lower (more negative) is better.
  • Figure 4: Overall trend of classification accuracy with respect to the attribute differences between target and interfering speakers for each cue type.
  • Figure 5: Classification accuracy and corresponding sample distribution across regions for independent (a) and relative (b) speaking rate cues, where $x^{(\mathit{tar})}$ (y-axis) and $x^{(\mathit{inf})}$ (x-axis) represent the speaking rate values of the target and interfering speakers, respectively.
  • ...and 2 more figures