Table of Contents
Fetching ...

DiffAU: Diffusion-Based Ambisonics Upscaling

Amit Milstein, Nir Shlezinger, Boaz Rafaely

TL;DR

This paper tackles the problem of enriching spatial audio realism by upscaling from first-order Ambisonics (FOA) to high-order Ambisonics (HOA) using diffusion-based posterior sampling. DiffAU implements a cascaded set of conditional diffusion blocks to progressively convert order-$N$ Ambisonic signals into order-$N+1$ and ultimately HOA, learning $p(\mathbf{a}_{N'}|\mathbf{a}_N)$ without relying on sparsity priors. Empirical results in anechoic, multi-speaker conditions show substantial objective gains (STFT-SDR) over sparsity-based baselines, and a formal listening test indicates perceptual indistinguishability from true HOA. The approach offers a principled, modular framework for Ambisonics upscaling with potential flexibility to higher orders and more challenging acoustic settings.

Abstract

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.

DiffAU: Diffusion-Based Ambisonics Upscaling

TL;DR

This paper tackles the problem of enriching spatial audio realism by upscaling from first-order Ambisonics (FOA) to high-order Ambisonics (HOA) using diffusion-based posterior sampling. DiffAU implements a cascaded set of conditional diffusion blocks to progressively convert order- Ambisonic signals into order- and ultimately HOA, learning without relying on sparsity priors. Empirical results in anechoic, multi-speaker conditions show substantial objective gains (STFT-SDR) over sparsity-based baselines, and a formal listening test indicates perceptual indistinguishability from true HOA. The approach offers a principled, modular framework for Ambisonics upscaling with potential flexibility to higher orders and more challenging acoustic settings.

Abstract

Spatial audio enhances immersion by reproducing 3D sound fields, with Ambisonics offering a scalable format for this purpose. While first-order Ambisonics (FOA) notably facilitates hardware-efficient acquisition and storage of sound fields as compared to high-order Ambisonics (HOA), its low spatial resolution limits realism, highlighting the need for Ambisonics upscaling (AU) as an approach for increasing the order of Ambisonics signals. In this work we propose DiffAU, a cascaded AU method that leverages recent developments in diffusion models combined with novel adaptation to spatial audio to generate 3rd order Ambisonics from FOA. By learning data distributions, DiffAU provides a principled approach that rapidly and reliably reproduces HOA in various settings. Experiments in anechoic conditions with multiple speakers, show strong objective and perceptual performance.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Schematic illustration of the overall architecture of DiffAU
  • Figure 2: Directional energy plots (azimuth-elevation). Columns: foa, 2nd- and 3rd-order Ambisonics ground truth, 2nd- and 3rd-order DiffAU outputs. Rows correspond to the number of active sources.
  • Figure 3: Listening test results