Table of Contents
Fetching ...

SpatialCodec: Neural Spatial Speech Coding

Zhongweiyang Xu, Yong Xu, Vinay Kothapally, Heming Wang, Muqiao Yang, Dong Yu

TL;DR

The paper addresses preserving spatial cues in multichannel speech at very low bitrates by proposing a two-branch neural framework, SpatialCodec, that separately encodes a reference channel and the spatial information needed to reconstruct remaining channels. The first branch uses a neural sub-band codec on the reference channel, while the second branch yields complex ratio filters to synthesize non-reference channels from the reference, trained with targeted losses to avoid mismatches. Novel spatial evaluation metrics, including spatial similarity and beamforming-based assessments, demonstrate that SpatialCodec at 12 kbps total outperforms high-bitrate baselines and channel-independent methods in spatial fidelity and DoA accuracy. This approach enables efficient, spatially faithful multichannel speech coding suitable for teleconferencing and meeting-room scenarios, with potential extensions to multi-speaker and moving-source environments.

Abstract

In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.

SpatialCodec: Neural Spatial Speech Coding

TL;DR

The paper addresses preserving spatial cues in multichannel speech at very low bitrates by proposing a two-branch neural framework, SpatialCodec, that separately encodes a reference channel and the spatial information needed to reconstruct remaining channels. The first branch uses a neural sub-band codec on the reference channel, while the second branch yields complex ratio filters to synthesize non-reference channels from the reference, trained with targeted losses to avoid mismatches. Novel spatial evaluation metrics, including spatial similarity and beamforming-based assessments, demonstrate that SpatialCodec at 12 kbps total outperforms high-bitrate baselines and channel-independent methods in spatial fidelity and DoA accuracy. This approach enables efficient, spatially faithful multichannel speech coding suitable for teleconferencing and meeting-room scenarios, with potential extensions to multi-speaker and moving-source environments.

Abstract

In this work, we address the challenge of encoding speech captured by a microphone array using deep learning techniques with the aim of preserving and accurately reconstructing crucial spatial cues embedded in multi-channel recordings. We propose a neural spatial audio coding framework that achieves a high compression ratio, leveraging single-channel neural sub-band codec and SpatialCodec. Our approach encompasses two phases: (i) a neural sub-band codec is designed to encode the reference channel with low bit rates, and (ii), a SpatialCodec captures relative spatial information for accurate multi-channel reconstruction at the decoder end. In addition, we also propose novel evaluation metrics to assess the spatial cue preservation: (i) spatial similarity, which calculates cosine similarity on a spatially intuitive beamspace, and (ii), beamformed audio quality. Our system shows superior spatial performance compared with high bitrate baselines and black-box neural architecture. Demos are available at https://xzwy.github.io/SpatialCodecDemo. Codes and models are available at https://github.com/XZWY/SpatialCodec.
Paper Structure (15 sections, 11 equations, 2 figures, 1 table)

This paper contains 15 sections, 11 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: An Overview of the proposed SpatialCodec framework.
  • Figure 2: Spatial Features Visualization (1kHz and 3kHz).