Modeling and Driving Human Body Soundfields through Acoustic Primitives
Chao Huang, Dejan Markovic, Chenliang Xu, Alexander Richard
TL;DR
The paper tackles the lack of realistic near-field spatial audio for full-body avatars by introducing acoustic primitives—low-order, sphere-attached sound sources—that together form a complete 3D soundfield. It learns primitive coefficients from synchronized pose and headset audio and renders audio at arbitrary locations with a differentiable renderer, leveraging spherical harmonics up to order $N=2$ and a tunable number of primitives $K$. The approach delivers near-field, drivable 3D audio with comparable quality to a state-of-the-art baseline while achieving ~15x faster inference, enabling real-time applications in VR/AR and games. While it relies on multi-microphone capture data, the method demonstrates strong potential for immersive avatar audio and suggests paths toward broader generalization and commodity-hardware learning. Key components include: (i) a pose and audio encoder that produces fused features guiding primitive decoding, (ii) a differentiable rendering equation based on spherical wave functions, and (iii) a loss suite combining multiscale STFT terms and a clip-level guidance to associate sounds with corresponding primitives.
Abstract
While rendering and animation of photorealistic 3D human body models have matured and reached an impressive quality over the past years, modeling the spatial audio associated with such full body models has been largely ignored so far. In this work, we present a framework that allows for high-quality spatial audio generation, capable of rendering the full 3D soundfield generated by a human body, including speech, footsteps, hand-body interactions, and others. Given a basic audio-visual representation of the body in form of 3D body pose and audio from a head-mounted microphone, we demonstrate that we can render the full acoustic scene at any point in 3D space efficiently and accurately. To enable near-field and realtime rendering of sound, we borrow the idea of volumetric primitives from graphical neural rendering and transfer them into the acoustic domain. Our acoustic primitives result in an order of magnitude smaller soundfield representations and overcome deficiencies in near-field rendering compared to previous approaches.
