Table of Contents
Fetching ...

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

Jakob Kienegger, Timo Gerkmann

TL;DR

This work tackles moving-speaker extraction when precise time-varying directional cues are unavailable. It introduces a weakly guided target speaker extraction pipeline that uses only the initial target direction $\theta_0$, paired with a deep tracking module to estimate the time-evolving direction $\theta_t$, and a spatially selective filter to extract the target mask $\mathcal{M}_{tk}$. A joint training strategy on a synthetic dataset with continuous motion demonstrates that the approach can resolve spatial ambiguities and even outperform a mismatched, strongly guided baseline, while remaining robust to inaccurate tracking. The results indicate substantial improvements in perceptual quality and intelligibility metrics in dynamic scenarios, highlighting the practical viability of reducing reliance on precise time-dependent directional cues in real-world moving-speaker audio processing.

Abstract

Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.

Steering Deep Non-Linear Spatially Selective Filters for Weakly Guided Extraction of Moving Speakers in Dynamic Scenarios

TL;DR

This work tackles moving-speaker extraction when precise time-varying directional cues are unavailable. It introduces a weakly guided target speaker extraction pipeline that uses only the initial target direction , paired with a deep tracking module to estimate the time-evolving direction , and a spatially selective filter to extract the target mask . A joint training strategy on a synthetic dataset with continuous motion demonstrates that the approach can resolve spatial ambiguities and even outperform a mismatched, strongly guided baseline, while remaining robust to inaccurate tracking. The results indicate substantial improvements in perceptual quality and intelligibility metrics in dynamic scenarios, highlighting the practical viability of reducing reliance on precise time-dependent directional cues in real-world moving-speaker audio processing.

Abstract

Recent speaker extraction methods using deep non-linear spatial filtering perform exceptionally well when the target direction is known and stationary. However, spatially dynamic scenarios are considerably more challenging due to time-varying spatial features and arising ambiguities, e.g. when moving speakers cross. While in a static scenario it may be easy for a user to point to the target's direction, manually tracking a moving speaker is impractical. Instead of relying on accurate time-dependent directional cues, which we refer to as strong guidance, in this paper we propose a weakly guided extraction method solely depending on the target's initial position to cope with spatial dynamic scenarios. By incorporating our own deep tracking algorithm and developing a joint training strategy on a synthetic dataset, we demonstrate the proficiency of our approach in resolving spatial ambiguities and even outperform a mismatched, but strongly guided extraction method.

Paper Structure

This paper contains 13 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Proposed weakly guided tse (tse) pipeline. In contrast to strongly guided tse, our method only requires the initial doa $\theta_0$ to extract the speech signal $\widehat{S}_{tk}$.
  • Figure 2: Proposed deep tst (tst) conditioned on the initial doa $\theta_0$. The network architecture is based on tesch24ssf_journal and bohlender21ssl_temporal_context. Colored layers indicate learnable parameters.
  • Figure 3: doa estimation with tst (tst) methods presented in \ref{['sec:nn_training']}. Selected trajectories correspond to \ref{['eq:motion_model']} with an expected displacement $\mathbb{E}\space\left\{ |\Delta \theta_t| \right\}$ of $\frac{180^\circ}{5\mathrm{s}}$.
  • Figure 4: Selected tse configurations evaluated with a metric sensitive to distortions. Shaded areas indicate std. deviation.
  • Figure 5: Accuracy (acc) and median angular error (ae) of tst algorithms. Shaded areas indicate 25% and 75% quartiles.
  • ...and 1 more figures