Table of Contents
Fetching ...

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

Yongyi Zang, Yifan Wang, Minglun Lee

TL;DR

Ambisonizer introduces a unified neural upmixing framework that directly generates first-order Ambisonic B-format from mono or stereo inputs by framing mono-to-any as unconditional generation and stereo-to-any as conditional generation using spherical harmonics. The model combines an audio encoder and a variational spatial encoder into a transformer-based bottleneck, with a decoder that outputs $Y_W$, $Y_X$, and $Y_Y$, while enforcing $Y_W = \tfrac{1}{2}(Y_L+Y_R)$ and reconstructing the stereo field via $Y = Y_W + Y_X \cos(\theta) + Y_Y \sin(\theta)$. Training uses an ELBO objective plus multi-resolution STFT and scaled $L_2$ losses, trained on synthetic Ambisonic data generated from Ambisonic IRs and MUSDB18-HQ sources. Subjective evaluations show Ambisonizer outputs, when decoded back to stereo, are competitive with a strong commercial mono-to-stereo baseline, highlighting the viability of first-order Ambisonics as an intermediate representation for channel-agnostic upmixing, albeit with noted limitations in the Ambisonic format and decoding artifacts. The work positions Ambisonic upmixing as a promising path for flexible spatial audio rendering and provides a foundation for open-source advancement in this area.

Abstract

Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly formulate mono upmixing as unconditional generation and stereo upmixing as conditional generation, where the stereo signals serve as conditions. We provide evidence that our proposed methodology, when decoded to stereo, matches a strong commercial stereo widener in subjective ratings. Overall, our work presents direct upmixing to Ambisonic format as a strong and promising approach to neural upmixing. A discussion on limitations is also provided.

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

TL;DR

Ambisonizer introduces a unified neural upmixing framework that directly generates first-order Ambisonic B-format from mono or stereo inputs by framing mono-to-any as unconditional generation and stereo-to-any as conditional generation using spherical harmonics. The model combines an audio encoder and a variational spatial encoder into a transformer-based bottleneck, with a decoder that outputs , , and , while enforcing and reconstructing the stereo field via . Training uses an ELBO objective plus multi-resolution STFT and scaled losses, trained on synthetic Ambisonic data generated from Ambisonic IRs and MUSDB18-HQ sources. Subjective evaluations show Ambisonizer outputs, when decoded back to stereo, are competitive with a strong commercial mono-to-stereo baseline, highlighting the viability of first-order Ambisonics as an intermediate representation for channel-agnostic upmixing, albeit with noted limitations in the Ambisonic format and decoding artifacts. The work positions Ambisonic upmixing as a promising path for flexible spatial audio rendering and provides a foundation for open-source advancement in this area.

Abstract

Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly formulate mono upmixing as unconditional generation and stereo upmixing as conditional generation, where the stereo signals serve as conditions. We provide evidence that our proposed methodology, when decoded to stereo, matches a strong commercial stereo widener in subjective ratings. Overall, our work presents direct upmixing to Ambisonic format as a strong and promising approach to neural upmixing. A discussion on limitations is also provided.
Paper Structure (20 sections, 9 equations, 5 figures, 1 table)

This paper contains 20 sections, 9 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Higher Order Ambisonic ($4^{th}$ order)eigenbeam2023
  • Figure 2: The Ambisonizer model architecture. Blue blocks denote encoders, and green block denotes the decoder. Best viewed in color.
  • Figure 3: Waves PS-22 Stereo Makerwaves2023ps22
  • Figure 4: Subjective rating results. 'All' setting is calculated by aggregating all individual sets; error bars are calculated with a 95% confidence interval.
  • Figure :