Table of Contents
Fetching ...

Real-Time Auralization for First-Person Vocal Interaction in Immersive Virtual Environments

Mauricio Flores-Vargas, Enda Bates, Rachel McDonnell

TL;DR

This paper addresses real-time auralization for first-person vocal interaction in immersive VR by introducing a dual pipeline: (i) SIR production from a first-person perspective to support $5DoF$ audio-visual perception, achieved with a grid of 20 positions and 4 directions per position captured as $16$-channel SIRs via a $3^{rd}$-order Ambisonics setup; and (ii) an audio processing pipeline that renders the user’s voice in real time using Unity for tracking and Reaper for convolution-based processing, with directional interpolation (constant-power panning) and translational interpolation (Inverse Distance Weighting). Latency is mitigated through direct-signal compensation and ITDG-based alignment, including a precise ITDG expression $\ Delta t = t_1 - t_0 = \frac{(L_1 + L_2) - L}{c}$, while practical choices omit floor reflections due to shadowing. The system encodes microphone input to $3OA$, routes through a 4x4 directional grid, and decodes to binaural output, enabling a first-person perception of spatialized acoustics in VR. Overall, the work demonstrates the feasibility of real-time, self-referential vocal auralization in immersive environments with potential for multimodal research and creative VR applications, underpinned by $w_i = \frac{1/d_i}{\sum 1/d_i}$ and directional gains $L(\theta)=\cos(\theta)$, $R(\theta)=\sin(\theta)$.

Abstract

Multimodal research and applications are becoming more commonplace as Virtual Reality (VR) technology integrates different sensory feedback, enabling the recreation of real spaces in an audio-visual context. Within VR experiences, numerous applications rely on the user's voice as a key element of interaction, including music performances and public speaking applications. Self-perception of our voice plays a crucial role in vocal production. When singing or speaking, our voice interacts with the acoustic properties of the environment, shaping the adjustment of vocal parameters in response to the perceived characteristics of the space. This technical report presents a real-time auralization pipeline that leverages three-dimensional Spatial Impulse Responses (SIRs) for multimodal research applications in VR requiring first-person vocal interaction. It describes the impulse response creation and rendering workflow, the audio-visual integration, and addresses latency and computational considerations. The system enables users to explore acoustic spaces from various positions and orientations within a predefined area, supporting three and five Degrees of Freedom (3Dof and 5DoF) in audio-visual multimodal perception for both research and creative applications in VR.

Real-Time Auralization for First-Person Vocal Interaction in Immersive Virtual Environments

TL;DR

This paper addresses real-time auralization for first-person vocal interaction in immersive VR by introducing a dual pipeline: (i) SIR production from a first-person perspective to support audio-visual perception, achieved with a grid of 20 positions and 4 directions per position captured as -channel SIRs via a -order Ambisonics setup; and (ii) an audio processing pipeline that renders the user’s voice in real time using Unity for tracking and Reaper for convolution-based processing, with directional interpolation (constant-power panning) and translational interpolation (Inverse Distance Weighting). Latency is mitigated through direct-signal compensation and ITDG-based alignment, including a precise ITDG expression , while practical choices omit floor reflections due to shadowing. The system encodes microphone input to , routes through a 4x4 directional grid, and decodes to binaural output, enabling a first-person perception of spatialized acoustics in VR. Overall, the work demonstrates the feasibility of real-time, self-referential vocal auralization in immersive environments with potential for multimodal research and creative VR applications, underpinned by and directional gains , .

Abstract

Multimodal research and applications are becoming more commonplace as Virtual Reality (VR) technology integrates different sensory feedback, enabling the recreation of real spaces in an audio-visual context. Within VR experiences, numerous applications rely on the user's voice as a key element of interaction, including music performances and public speaking applications. Self-perception of our voice plays a crucial role in vocal production. When singing or speaking, our voice interacts with the acoustic properties of the environment, shaping the adjustment of vocal parameters in response to the perceived characteristics of the space. This technical report presents a real-time auralization pipeline that leverages three-dimensional Spatial Impulse Responses (SIRs) for multimodal research applications in VR requiring first-person vocal interaction. It describes the impulse response creation and rendering workflow, the audio-visual integration, and addresses latency and computational considerations. The system enables users to explore acoustic spaces from various positions and orientations within a predefined area, supporting three and five Degrees of Freedom (3Dof and 5DoF) in audio-visual multimodal perception for both research and creative applications in VR.

Paper Structure

This paper contains 14 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Grid of recording positions.
  • Figure 2: Speaker directivity pattern.
  • Figure 3: Directional (left) and Translation (right) interpolation.
  • Figure 4: Signal flow showing two of the twenty folder tracks in the Reaper project.