Table of Contents
Fetching ...

Room Transfer Function Reconstruction Using Complex-valued Neural Networks and Irregularly Distributed Microphones

Francesca Ronchini, Luca Comanducci, Mirco Pezzoli, Fabio Antonacci, Augusto Sarti

TL;DR

This work tackles reconstructing room transfer functions (RTFs) to recover a full complex sound field from sparse, irregular microphone measurements. It introduces a complex-valued neural network (CVNN) with a U-Net-like architecture that ingests incomplete RTF data $\tilde{\mathbf{G}}$ and a measurement mask, learning a mapping $\mathcal{U}(\tilde{\mathbf{G}})$ to full complex RTFs across $K$ frequency bins in the modal range $[30,300]$ Hz. The method is evaluated on 5{,}000 synthetic rooms and real ISOBEL Room B data, demonstrating superior complex NMSE and phase fidelity compared to kernel-based interpolation, and showing favorable comparisons with magnitude-only approaches especially at low frequencies. This CVNN-based approach enables accurate, phase-aware sound-field reconstruction with relatively few sensors, supporting improved immersive audio, room acoustics analysis, and practical deployment in varied rooms.

Abstract

Reconstructing the room transfer functions needed to calculate the complex sound field in a room has several important real-world applications. However, an unpractical number of microphones is often required. Recently, in addition to classical signal processing methods, deep learning techniques have been applied to reconstruct the room transfer function starting from a very limited set of measurements at scattered points in the room. In this paper, we employ complex-valued neural networks to estimate room transfer functions in the frequency range of the first room resonances, using a few irregularly distributed microphones. To the best of our knowledge, this is the first time that complex-valued neural networks are used to estimate room transfer functions. To analyze the benefits of applying complex-valued optimization to the considered task, we compare the proposed technique with a state-of-the-art kernel-based signal processing approach for sound field reconstruction, showing that the proposed technique exhibits relevant advantages in terms of phase accuracy and overall quality of the reconstructed sound field. For informative purposes, we also compare the model with a similarly-structured data-driven approach that, however, applies a real-valued neural network to reconstruct only the magnitude of the sound field.

Room Transfer Function Reconstruction Using Complex-valued Neural Networks and Irregularly Distributed Microphones

TL;DR

This work tackles reconstructing room transfer functions (RTFs) to recover a full complex sound field from sparse, irregular microphone measurements. It introduces a complex-valued neural network (CVNN) with a U-Net-like architecture that ingests incomplete RTF data and a measurement mask, learning a mapping to full complex RTFs across frequency bins in the modal range Hz. The method is evaluated on 5{,}000 synthetic rooms and real ISOBEL Room B data, demonstrating superior complex NMSE and phase fidelity compared to kernel-based interpolation, and showing favorable comparisons with magnitude-only approaches especially at low frequencies. This CVNN-based approach enables accurate, phase-aware sound-field reconstruction with relatively few sensors, supporting improved immersive audio, room acoustics analysis, and practical deployment in varied rooms.

Abstract

Reconstructing the room transfer functions needed to calculate the complex sound field in a room has several important real-world applications. However, an unpractical number of microphones is often required. Recently, in addition to classical signal processing methods, deep learning techniques have been applied to reconstruct the room transfer function starting from a very limited set of measurements at scattered points in the room. In this paper, we employ complex-valued neural networks to estimate room transfer functions in the frequency range of the first room resonances, using a few irregularly distributed microphones. To the best of our knowledge, this is the first time that complex-valued neural networks are used to estimate room transfer functions. To analyze the benefits of applying complex-valued optimization to the considered task, we compare the proposed technique with a state-of-the-art kernel-based signal processing approach for sound field reconstruction, showing that the proposed technique exhibits relevant advantages in terms of phase accuracy and overall quality of the reconstructed sound field. For informative purposes, we also compare the model with a similarly-structured data-driven approach that, however, applies a real-valued neural network to reconstruct only the magnitude of the sound field.
Paper Structure (12 sections, 9 equations, 5 figures)

This paper contains 12 sections, 9 equations, 5 figures.

Figures (5)

  • Figure 1: Schematic representation of the proposed CVNN. The first block represents the input of the network, followed by the four complex-valued convolutional encoder's layers, and the five complex-valued convolutional decoders's layers.
  • Figure 2: Magnitude of the sound field, using: CVNN method (b), Lluis et al. lluis2020sound (c), Ueno et al. ueno2018kernel (d). We use the $m=15$ microphones configuration shown in (e). Ground truth magnitude shown in (a). The size of the room is $[4.8~\mathrm{m} \times 5.4~\mathrm{m} \times 2.4~\mathrm{m}]$. A source at $100~\mathrm{Hz}$ is placed at $[2.1~\mathrm{m}, 2~\mathrm{m}, 1.2~\mathrm{m}]^T$.
  • Figure 3: Phase of the sound field obtained using the same configuration as in Fig. \ref{['fig:example_magnitude']} using: CVNN (a), Ueno et al. ueno2018kernel (b). Ground truth shown in (c).
  • Figure 4: (a) $\mathrm{NMSE}_\text{complex}$ calculated over simulated data with varying $T_{60}$ levels and fixed number of microphones $m=55$. (b) $\mathrm{NMSE}_\text{complex}$ calculated over simulated data with varying number of microphones $m$ and fixed $T_{60}=1s$. (c) $\mathrm{NMSE}_\text{complex}$ calculated over real data with varying number of microphones $m$ and fixed $T_{60}=1s$. The straight line corresponds to the proposed CVNN method, while the dashed line corresponds to the kernel-based technique ueno2018kernel.
  • Figure 5: (a) $\mathrm{NMSE}_\text{abs}$ calculated over simulated data with varying $T_{60}$ levels and fixed number of microphones $m=55$. (b) $\mathrm{NMSE}_\text{abs}$ calculated over simulated data with varying number of microphones $m$ and fixed $T_{60}=1~\mathrm{s}$. (c) $\mathrm{NMSE}_\text{abs}$ calculated over real data with varying number of microphones $m$ and fixed $T_{60}=1~\mathrm{s}$. The straight line corresponds to the proposed CVNN method, while the dashed line corresponds to the data-driven approach lluis2020sound.