Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger; Timo Gerkmann

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Jakob Kienegger, Timo Gerkmann

Abstract

Deep spatially selective filters achieve high-quality enhancement with real-time capable architectures for stationary speakers of known directions. To retain this level of performance in dynamic scenarios when only the speakers' initial directions are given, accurate, yet computationally lightweight tracking algorithms become necessary. Assuming a frame-wise causal processing style, temporal feedback allows for leveraging the enhanced speech signal to improve tracking performance. In this work, we investigate strategies to incorporate the enhanced signal into lightweight tracking algorithms and autoregressively guide deep spatial filters. Our proposed Bayesian tracking algorithms are compatible with arbitrary deep spatial filters. To increase the realism of simulated trajectories during development and evaluation, we propose and publish a novel dataset based on the social force model. Results validate that the autoregressive incorporation significantly improves the accuracy of our Bayesian trackers, resulting in superior enhancement with none or only negligibly increased computational overhead. Real-world recordings complement these findings and demonstrate the generalizability of our methods to unseen, challenging acoustic conditions.

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Abstract

Paper Structure (20 sections, 41 equations, 8 figures, 1 table, 1 algorithm)

This paper contains 20 sections, 41 equations, 8 figures, 1 table, 1 algorithm.

Introduction
Problem Definition
Steering Spatial Filters
Strongly Guided Target Speaker Extraction
Bayesian Tracking for Weakly Guided Speaker Extraction
Autoregressively Guided Spatial Filters
Extended Bayesian Filtering Formulations
Multiple-Input and Multiple-Output (MIMO) Spatial Filters
Dataset
Acoustic Dataset Parametrization
Social Force Motion Model
Experimental Setup
Model and Algorithm Parametrization
Training and Optimization Details
Evaluation
...and 5 more sections

Figures (8)

Figure 1: Weakly guided speaker extraction using tst (tst) to estimate the target’s direction $\theta_t$ from starting direction $\theta_0$ and guide a ssf (ssf) for enhancement. We propose an autoregressive (AR) integration of the processed speech for improved guidance in (b) and (c).
Figure 2: Social force motion model adapted from helbing95social_force_model to simulate planar two speaker (/) trajectories in an enclosed room. An underlying Newtonian formulation enforces smooth motion patterns while satisfying boundary constraints. Dataset generation code and further visualizations are available online\ref{['code_page']}.
Figure 3: Sample two-speaker (/) trajectories using Wrapped KF () and Bootstrap PF () for tracking in (top to bottom) concatentative (\ref{['fig:weak_ssf']}) and our autoregressive (MISO-AR: \ref{['fig:weak_ssf_miso']}, MIMO-AR: \ref{['fig:weak_ssf_mimo']}) configurations.
Figure 4: Closed-loop (AR) parameter optimization on the validation set using the Wrapped KF for TST and SpatialNet-MIMO as SSF (MIMO-AR, \ref{['fig:weak_ssf_mimo']}).
Figure 5: Computational cost (MACs) and tracking performance of our Bayesian filters (Wrapped KF, Bootstrap PF) relative to dnn (SELDnet, CNN/LSTM).
...and 3 more figures

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Abstract

Autoregressive Guidance of Deep Spatially Selective Filters using Bayesian Tracking for Efficient Extraction of Moving Speakers

Authors

Abstract

Table of Contents

Figures (8)