Table of Contents
Fetching ...

Task and Perception-aware Distributed Source Coding for Correlated Speech under Bandwidth-constrained Channels

Sagnik Bhattacharya, Muhammad Ahmed Mohsin, Ahsan Bilal, John M. Cioffi

TL;DR

This work tackles real-time transmission of correlated speech from multiple edge devices over bandwidth-constrained wireless channels. It introduces a CSI-aware, NDPCA-aided distributed autoencoder with a perception-aware downstream task loss, augmented by a score-based diffusion model for downstream speech enhancement and a MS-STFT-based perceptual loss. The approach yields substantial PSNR gains (e.g., 19% in task-agnostic and 52% in task-aware settings) and approaches the single-encoder upper bound, particularly at low bandwidth, while providing a rate-distortion-perception curve for adaptive realism. The method advances compression efficiency, perceptual realism, and task performance in wireless AR/VR scenarios, enabling dynamic bandwidth adaptation without retraining and effective exploitation of inter-source speech correlations.

Abstract

Emerging wireless AR/VR applications require real-time transmission of correlated high-fidelity speech from multiple resource-constrained devices over unreliable, bandwidth-limited channels. Existing autoencoder-based speech source coding methods fail to address the combination of the following - (1) dynamic bitrate adaptation without retraining the model, (2) leveraging correlations among multiple speech sources, and (3) balancing downstream task loss with realism of reconstructed speech. We propose a neural distributed principal component analysis (NDPCA)-aided distributed source coding algorithm for correlated speech sources transmitting to a central receiver. Our method includes a perception-aware downstream task loss function that balances perceptual realism with task-specific performance. Experiments show significant PSNR improvements under bandwidth constraints over naive autoencoder methods in task-agnostic (19%) and task-aware settings (52%). It also approaches the theoretical upper bound, where all correlated sources are sent to a single encoder, especially in low-bandwidth scenarios. Additionally, we present a rate-distortion-perception trade-off curve, enabling adaptive decisions based on application-specific realism needs.

Task and Perception-aware Distributed Source Coding for Correlated Speech under Bandwidth-constrained Channels

TL;DR

This work tackles real-time transmission of correlated speech from multiple edge devices over bandwidth-constrained wireless channels. It introduces a CSI-aware, NDPCA-aided distributed autoencoder with a perception-aware downstream task loss, augmented by a score-based diffusion model for downstream speech enhancement and a MS-STFT-based perceptual loss. The approach yields substantial PSNR gains (e.g., 19% in task-agnostic and 52% in task-aware settings) and approaches the single-encoder upper bound, particularly at low bandwidth, while providing a rate-distortion-perception curve for adaptive realism. The method advances compression efficiency, perceptual realism, and task performance in wireless AR/VR scenarios, enabling dynamic bandwidth adaptation without retraining and effective exploitation of inter-source speech correlations.

Abstract

Emerging wireless AR/VR applications require real-time transmission of correlated high-fidelity speech from multiple resource-constrained devices over unreliable, bandwidth-limited channels. Existing autoencoder-based speech source coding methods fail to address the combination of the following - (1) dynamic bitrate adaptation without retraining the model, (2) leveraging correlations among multiple speech sources, and (3) balancing downstream task loss with realism of reconstructed speech. We propose a neural distributed principal component analysis (NDPCA)-aided distributed source coding algorithm for correlated speech sources transmitting to a central receiver. Our method includes a perception-aware downstream task loss function that balances perceptual realism with task-specific performance. Experiments show significant PSNR improvements under bandwidth constraints over naive autoencoder methods in task-agnostic (19%) and task-aware settings (52%). It also approaches the theoretical upper bound, where all correlated sources are sent to a single encoder, especially in low-bandwidth scenarios. Additionally, we present a rate-distortion-perception trade-off curve, enabling adaptive decisions based on application-specific realism needs.

Paper Structure

This paper contains 39 sections, 18 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: End to end proposed pipeline for distributed downstream speech enhancement using perceptual loss
  • Figure 2: Proposed task aware speech enhancement using perceptual loss.
  • Figure 3: (a) Efficient source coding using CSI-aware feedback vs. channel capacit, (b) Dimensions of latent space vs. task agnostic PSNR [db] for baseline E1D1, E2D1 and E4D1 autoencoders, (c) Dimensions of latent space vs. task aware PSNR [db] for baseline E1D1, E2D1 and E4D1 autoencoders.
  • Figure 4: Experimental distortion versus total bandwidth under varying relative weights given to perception loss
  • Figure 5: Floor plan for correlated audio data retrieval for distributed task-aware source coding barker18_interspeech.
  • ...and 3 more figures