Task and Perception-aware Distributed Source Coding for Correlated Speech under Bandwidth-constrained Channels
Sagnik Bhattacharya, Muhammad Ahmed Mohsin, Ahsan Bilal, John M. Cioffi
TL;DR
This work tackles real-time transmission of correlated speech from multiple edge devices over bandwidth-constrained wireless channels. It introduces a CSI-aware, NDPCA-aided distributed autoencoder with a perception-aware downstream task loss, augmented by a score-based diffusion model for downstream speech enhancement and a MS-STFT-based perceptual loss. The approach yields substantial PSNR gains (e.g., 19% in task-agnostic and 52% in task-aware settings) and approaches the single-encoder upper bound, particularly at low bandwidth, while providing a rate-distortion-perception curve for adaptive realism. The method advances compression efficiency, perceptual realism, and task performance in wireless AR/VR scenarios, enabling dynamic bandwidth adaptation without retraining and effective exploitation of inter-source speech correlations.
Abstract
Emerging wireless AR/VR applications require real-time transmission of correlated high-fidelity speech from multiple resource-constrained devices over unreliable, bandwidth-limited channels. Existing autoencoder-based speech source coding methods fail to address the combination of the following - (1) dynamic bitrate adaptation without retraining the model, (2) leveraging correlations among multiple speech sources, and (3) balancing downstream task loss with realism of reconstructed speech. We propose a neural distributed principal component analysis (NDPCA)-aided distributed source coding algorithm for correlated speech sources transmitting to a central receiver. Our method includes a perception-aware downstream task loss function that balances perceptual realism with task-specific performance. Experiments show significant PSNR improvements under bandwidth constraints over naive autoencoder methods in task-agnostic (19%) and task-aware settings (52%). It also approaches the theoretical upper bound, where all correlated sources are sent to a single encoder, especially in low-bandwidth scenarios. Additionally, we present a rate-distortion-perception trade-off curve, enabling adaptive decisions based on application-specific realism needs.
