Table of Contents
Fetching ...

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

Anton Ratnarajah, Shi-Xiong Zhang, Dong Yu

TL;DR

BANC is introduced, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker while preserving the spatial location information of each speaker.

Abstract

We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 3) BANC's proficiency in reducing the bandwidth required for compressing binaural speech by 48% compared to compressing individual binaural channels. In our evaluation, we employed speech enhancement, room acoustics, and perceptual metrics to assess the accuracy of BANC's clean speech and spatial cue estimates.

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

TL;DR

BANC is introduced, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker while preserving the spatial location information of each speaker.

Abstract

We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The ability of our proposed model to compress and decode overlapping speech. 2) A novel architecture that compresses speech content and spatial cues separately, ensuring the preservation of each speaker's spatial context after decoding. 3) BANC's proficiency in reducing the bandwidth required for compressing binaural speech by 48% compared to compressing individual binaural channels. In our evaluation, we employed speech enhancement, room acoustics, and perceptual metrics to assess the accuracy of BANC's clean speech and spatial cue estimates.
Paper Structure (8 sections, 11 equations, 1 figure, 3 tables)

This paper contains 8 sections, 11 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Our proposed NAC configured for single-speaker and two-speaker two-spatial overlapped binaural speech. In § \ref{['multi-audio']}, we describe the details of our architecture and training paradigm. We train end-to-end networks with metric loss (Eq. \ref{['metric_loss']}) for 200K iterations. Then, we freeze the blocks shown in blue and train the rest of the network with the metric and adversarial loss (Eq. \ref{['adversarial_loss']}) for an additional 500k iterations (single-speaker) and 160k iterations (two-speaker).