Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

Shoma Ayano; Li Li; Shogo Seki; Daichi Kitamura

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

Shoma Ayano, Li Li, Shogo Seki, Daichi Kitamura

TL;DR

A new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters is proposed.

Abstract

Spotforming is a target-speaker extraction technique that uses multiple microphone arrays. This method applies beamforming (BF) to each microphone array, and the common components among the BF outputs are estimated as the target source. This study proposes a new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters. Moreover, attractor-based regularization was introduced to facilitate the automatic selection of optimal target bases in the NTF. Experimental results show that the proposed method performs better than conventional methods in spotforming performance and also shows some characteristics suitable for practical use.

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

TL;DR

A new common component extraction method based on nonnegative tensor factorization (NTF) for higher model interpretability and more robust spotforming against hyperparameters is proposed.

Abstract

Paper Structure (12 sections, 1 theorem, 19 equations, 5 figures, 2 tables)

This paper contains 12 sections, 1 theorem, 19 equations, 5 figures, 2 tables.

Introduction
Spotforming Using Multiple Microphone Arrays
Scenario of Spotforming and Its Signal Model
Conventional NMF-Based Spotforming
Proposed Method
Motivations
NTF-Based Spotforming
Derivation of Update Rules
Experiment
Conditions
Results and Discussion
Conclusion

Key Result

Theorem 1

The update rules eq:bk, eq:updZ--eq:updV ensure the monotonic nonincrease of the cost function in eq:propCost.

Figures (5)

Figure 1: Situations and signals estimated by two BF filters.
Figure 2: Decomposition models of (a) NMF in the conventional method and (b) NTF in the proposed method.
Figure 3: Recording environments simulated by the two-dimensional image method: (a) two-microphone-array and (b) three-microphone-array cases. All the microphone spacing in each array is set to 2.83 cm.
Figure 4: SDR scores with various $K$ in the two-microphone-array case: (a) $T_{60}=0$ ms and (b) $T_{60}=256$ ms. The plots and colored areas show the average values and standard deviations. Average SDRs of simple BF outputs $\bm{Y}^{(0)}$ and $\bm{Y}^{(1)}$ were 9.4 dB in (a) and 6.7 dB in (b).
Figure 5: SDR scores with various $K$ in the three-microphone-array case: (a) $T_{60}=0$ ms and (b) $T_{60}=256$ ms. The plots and colored areas show the average values and standard deviations. Average SDRs of simple BF outputs $\bm{Y}^{(0)}$, $\bm{Y}^{(1)}$, and $\bm{Y}^{(2)}$ were 7.1 dB in (a) and 4.1 dB in (b).

Theorems & Definitions (2)

Theorem 1
proof

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

TL;DR

Abstract

Audio Spotforming Using Nonnegative Tensor Factorization with Attractor-Based Regularization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)