Ambisonizer: Neural Upmixing as Spherical Harmonics Generation
Yongyi Zang, Yifan Wang, Minglun Lee
TL;DR
Ambisonizer introduces a unified neural upmixing framework that directly generates first-order Ambisonic B-format from mono or stereo inputs by framing mono-to-any as unconditional generation and stereo-to-any as conditional generation using spherical harmonics. The model combines an audio encoder and a variational spatial encoder into a transformer-based bottleneck, with a decoder that outputs $Y_W$, $Y_X$, and $Y_Y$, while enforcing $Y_W = \tfrac{1}{2}(Y_L+Y_R)$ and reconstructing the stereo field via $Y = Y_W + Y_X \cos(\theta) + Y_Y \sin(\theta)$. Training uses an ELBO objective plus multi-resolution STFT and scaled $L_2$ losses, trained on synthetic Ambisonic data generated from Ambisonic IRs and MUSDB18-HQ sources. Subjective evaluations show Ambisonizer outputs, when decoded back to stereo, are competitive with a strong commercial mono-to-stereo baseline, highlighting the viability of first-order Ambisonics as an intermediate representation for channel-agnostic upmixing, albeit with noted limitations in the Ambisonic format and decoding artifacts. The work positions Ambisonic upmixing as a promising path for flexible spatial audio rendering and provides a foundation for open-source advancement in this area.
Abstract
Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly formulate mono upmixing as unconditional generation and stereo upmixing as conditional generation, where the stereo signals serve as conditions. We provide evidence that our proposed methodology, when decoded to stereo, matches a strong commercial stereo widener in subjective ratings. Overall, our work presents direct upmixing to Ambisonic format as a strong and promising approach to neural upmixing. A discussion on limitations is also provided.
