Robust Multi-modal Task-oriented Communications with Redundancy-aware Representations
Jingwen Fu, Ming Xiao, Zhonghao Lyu, Mikael Skoglund, Celimuge Wu
TL;DR
This work tackles robust multi-modal task-oriented communications by jointly compressing modality-specific information and reducing inter-modal redundancy. It introduces a two-stage variational information bottleneck (VIB) framework: Stage I applies uni-modal VIB to each modality, and Stage II fuses these representations and applies a multi-modal VIB (M-VIB) to enhance robustness to channel distortions. Cross-modal redundancy minimization is achieved via a mutual-information discriminator with a gradient reversal layer, enforcing independence among modalities through a variational JS-divergence bound. Experiments on MOSI and MOSEI demonstrate superior task performance and robustness under AWGN and Rayleigh channels, with MI minimization validated by discriminator outputs approaching independence and by improved accuracy in low-SNR regimes. Overall, the framework provides a principled approach to jointly optimize modality-specific compression, inter-modal redundancy suppression, and communication reliability in semantic multi-modal TOC systems.
Abstract
Semantic communications for multi-modal data can transmit task-relevant information efficiently over noisy and bandwidth-limited channels. However, a key challenge is to simultaneously compress inter-modal redundancy and improve semantic reliability under channel distortion. To address the challenge, we propose a robust and efficient multi-modal task-oriented communication framework that integrates a two-stage variational information bottleneck (VIB) with mutual information (MI) redundancy minimization. In the first stage, we apply uni-modal VIB to compress each modality separately, i.e., text, audio, and video, while preserving task-specific features. To enhance efficiency, an MI minimization module with adversarial training is then used to suppress cross-modal dependencies and to promote complementarity rather than redundancy. In the second stage, a multi-modal VIB is further used to compress the fused representation and to enhance robustness against channel distortion. Experimental results on multi-modal emotion recognition tasks demonstrate that the proposed framework significantly outperforms existing baselines in accuracy and reliability, particularly under low signal-to-noise ratio regimes. Our work provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability.
