Table of Contents
Fetching ...

Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

Francesc Lluís, Nils Meyer-Kahlen

TL;DR

This work proposes to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information, and shows how both room- and position-specific parameters are considered in the final output.

Abstract

For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.

Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

TL;DR

This work proposes to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information, and shows how both room- and position-specific parameters are considered in the final output.

Abstract

For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.
Paper Structure (11 sections, 4 equations, 5 figures, 1 table)

This paper contains 11 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Training and inference stages. Initially, the encoder is trained using contrastive learning to capture all unique features of the room. During the generator's training, the encoder's weights are kept fixed. The generator, conditioned with a receiver-source position vector and a room-specific embedding, is then trained to produce spatial room impulse responses that incorporate both room-specific and position-dependent features. The inference path is indicated in dashed arrows.
  • Figure 2: a) Diagram of the generator deep neural network architecture. b) Diagram of the Residual Block of the generator.
  • Figure 3: Reverberation time in octave bands determined from the generated and the true responses for the PROPOSED model. Underestimation occurs mainly at high RTs.
  • Figure 4: Example of true DRR and DRR of the generated SRIR along a line leading past the source for the PROPOSED model.
  • Figure 5: Example of direct sound DoAs represented by arrows. Receiver positions as dots, source position as red square. The DoA obtained from generated responses on above and ground truth DoAs below. Results are shown for the PROPOSED model.