Table of Contents
Fetching ...

Modeling and Driving Human Body Soundfields through Acoustic Primitives

Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard

TL;DR

The paper tackles the lack of realistic near-field spatial audio for full-body avatars by introducing acoustic primitives—low-order, sphere-attached sound sources—that together form a complete 3D soundfield. It learns primitive coefficients from synchronized pose and headset audio and renders audio at arbitrary locations with a differentiable renderer, leveraging spherical harmonics up to order $N=2$ and a tunable number of primitives $K$. The approach delivers near-field, drivable 3D audio with comparable quality to a state-of-the-art baseline while achieving ~15x faster inference, enabling real-time applications in VR/AR and games. While it relies on multi-microphone capture data, the method demonstrates strong potential for immersive avatar audio and suggests paths toward broader generalization and commodity-hardware learning. Key components include: (i) a pose and audio encoder that produces fused features guiding primitive decoding, (ii) a differentiable rendering equation based on spherical wave functions, and (iii) a loss suite combining multiscale STFT terms and a clip-level guidance to associate sounds with corresponding primitives.

Abstract

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.

Modeling and Driving Human Body Soundfields through Acoustic Primitives

TL;DR

The paper tackles the lack of realistic near-field spatial audio for full-body avatars by introducing acoustic primitives—low-order, sphere-attached sound sources—that together form a complete 3D soundfield. It learns primitive coefficients from synchronized pose and headset audio and renders audio at arbitrary locations with a differentiable renderer, leveraging spherical harmonics up to order and a tunable number of primitives . The approach delivers near-field, drivable 3D audio with comparable quality to a state-of-the-art baseline while achieving ~15x faster inference, enabling real-time applications in VR/AR and games. While it relies on multi-microphone capture data, the method demonstrates strong potential for immersive avatar audio and suggests paths toward broader generalization and commodity-hardware learning. Key components include: (i) a pose and audio encoder that produces fused features guiding primitive decoding, (ii) a differentiable rendering equation based on spherical wave functions, and (iii) a loss suite combining multiscale STFT terms and a clip-level guidance to associate sounds with corresponding primitives.

Abstract

While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.
Paper Structure (19 sections, 11 equations, 10 figures, 3 tables)

This paper contains 19 sections, 11 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Single high-order soundfield (a) vs. acoustic primitives (b). Existing approach xudong2023sounding predicts a high-order ambisonic soundfield around the human body, preventing sound from being rendered in the near-field; our proposed acoustic primitives, represented as small spheres attached to the body, successfully model a complete and accurate 3D body soundfield.
  • Figure 2: Our pose-guided acoustic primitive learning framework takes headset microphone signals and body pose information as inputs. It outputs the acoustic primitive representations, weights, and offsets in one pass. The framework consists of two main stages. In the first stage, we employ separate encoders to process the audio and pose signals into feature spaces. An Audio-Pose Feature Fusion Module is then utilized to combine these features. In the second stage, the fused features are fed into an audio decoder network to generate the acoustic primitive coefficients. Additionally, two separate MLP heads are used to predict the weights and offsets for each acoustic primitive.
  • Figure 3: Illustration on the rendering process with the estimated acoustic primitives. (a) demonstrates how to render a waveform signal given the learned harmonic coefficients $\mathcal{S}_k$, primitive coordinate offset $\Delta_k$, and the weight $W_k$. Next, we show that for all the primitives, we render audio generated by the primitive at the targeted location and aggregate them to yield the final rendered audio at target position $(x,y,z)$.
  • Figure 4: Sound field visualizations for 4 different kinds of sound. Main sound field is in the center and individual primitive contributions are shown around. We can observe that the method assigns acoustic energy to correct acoustic primitives, e.g. speech comes mostly from the head with only a very small contribution from the shoulder primitives. We can also observe the speech directivity pattern matching the head orientation. For each visualization, the left/right 4 primitives are labeled as follows: foot, hip, hand, and shoulder (from bottom to top), and the middle one is the head.
  • Figure 5: Predicted and ground truth microphone signals at 5 different locations around the dome. We can observe good temporal alignment and good amplitude match except for the low-energy body tapping sound. We recommend zooming in for better visibility.
  • ...and 5 more figures