Table of Contents
Fetching ...

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger, Timo Gerkmann

Abstract

Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers' initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Abstract

Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers' initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.
Paper Structure (20 sections, 41 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 41 equations, 8 figures, 1 table, 1 algorithm.

Figures (8)

  • Figure 1: Weakly guided speaker extraction using tst (tst) to estimate the target’s direction $\theta_t$ from starting direction $\theta_0$ and guide a ssf (ssf) for enhancement. We propose an autoregressive (AR) integration of the processed speech for improved guidance in (b) and (c).
  • Figure 2: Social force motion model adapted from helbing95social_force_model to simulate planar two speaker (/) trajectories in an enclosed room. An underlying Newtonian formulation enforces smooth motion patterns while satisfying boundary constraints. Dataset generation code and further visualizations are available online\ref{['code_page']}.
  • Figure 3: Sample two-speaker (/) trajectories using Wrapped KF () and Bootstrap PF () for tracking in (top to bottom) concatentative (\ref{['fig:weak_ssf']}) and our autoregressive (MISO-AR: \ref{['fig:weak_ssf_miso']}, MIMO-AR: \ref{['fig:weak_ssf_mimo']}) configurations.
  • Figure 4: Closed-loop (AR) parameter optimization on the validation set using the Wrapped KF for TST and SpatialNet-MIMO as SSF (MIMO-AR, \ref{['fig:weak_ssf_mimo']}).
  • Figure 5: Computational cost (MACs) and tracking performance of our Bayesian filters (Wrapped KF, Bootstrap PF) relative to dnn (SELDnet, CNN/LSTM).
  • ...and 3 more figures