Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

Yongyi Zang; Yifan Wang; Minglun Lee

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

Yongyi Zang, Yifan Wang, Minglun Lee

TL;DR

Ambisonizer introduces a unified neural upmixing framework that directly generates first-order Ambisonic B-format from mono or stereo inputs by framing mono-to-any as unconditional generation and stereo-to-any as conditional generation using spherical harmonics. The model combines an audio encoder and a variational spatial encoder into a transformer-based bottleneck, with a decoder that outputs $Y_W$, $Y_X$, and $Y_Y$, while enforcing $Y_W = \tfrac{1}{2}(Y_L+Y_R)$ and reconstructing the stereo field via $Y = Y_W + Y_X \cos(\theta) + Y_Y \sin(\theta)$. Training uses an ELBO objective plus multi-resolution STFT and scaled $L_2$ losses, trained on synthetic Ambisonic data generated from Ambisonic IRs and MUSDB18-HQ sources. Subjective evaluations show Ambisonizer outputs, when decoded back to stereo, are competitive with a strong commercial mono-to-stereo baseline, highlighting the viability of first-order Ambisonics as an intermediate representation for channel-agnostic upmixing, albeit with noted limitations in the Ambisonic format and decoding artifacts. The work positions Ambisonic upmixing as a promising path for flexible spatial audio rendering and provides a foundation for open-source advancement in this area.

Abstract

Neural upmixing, the task of generating immersive music with an increased number of channels from fewer input channels, has been an active research area, with mono-to-stereo and stereo-to-surround upmixing treated as separate problems. In this paper, we propose a unified approach to neural upmixing by formulating it as spherical harmonics - more specifically, Ambisonic generation. We explicitly formulate mono upmixing as unconditional generation and stereo upmixing as conditional generation, where the stereo signals serve as conditions. We provide evidence that our proposed methodology, when decoded to stereo, matches a strong commercial stereo widener in subjective ratings. Overall, our work presents direct upmixing to Ambisonic format as a strong and promising approach to neural upmixing. A discussion on limitations is also provided.

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

TL;DR

, and

, while enforcing

and reconstructing the stereo field via

. Training uses an ELBO objective plus multi-resolution STFT and scaled

losses, trained on synthetic Ambisonic data generated from Ambisonic IRs and MUSDB18-HQ sources. Subjective evaluations show Ambisonizer outputs, when decoded back to stereo, are competitive with a strong commercial mono-to-stereo baseline, highlighting the viability of first-order Ambisonics as an intermediate representation for channel-agnostic upmixing, albeit with noted limitations in the Ambisonic format and decoding artifacts. The work positions Ambisonic upmixing as a promising path for flexible spatial audio rendering and provides a foundation for open-source advancement in this area.

Abstract

Paper Structure (20 sections, 9 equations, 5 figures, 1 table)

This paper contains 20 sections, 9 equations, 5 figures, 1 table.

Introduction
The Ambisonic format
First-Order Ambisonics
Higher Order Ambisonics
The Ambisonizer Model
Audio and Spatial Encoders
Bottleneck
Decoder
Loss Function
Experiments
Synthesizing first-order Ambisonic data
Source Datasets
Ambisonic IR Datasets
Sound Sources Datasets
Experimental Setup
...and 5 more sections

Figures (5)

Figure 1: Higher Order Ambisonic ($4^{th}$ order)eigenbeam2023
Figure 2: The Ambisonizer model architecture. Blue blocks denote encoders, and green block denotes the decoder. Best viewed in color.
Figure 3: Waves PS-22 Stereo Makerwaves2023ps22
Figure 4: Subjective rating results. 'All' setting is calculated by aggregating all individual sets; error bars are calculated with a 95% confidence interval.
Figure :

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

TL;DR

Abstract

Ambisonizer: Neural Upmixing as Spherical Harmonics Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)