Table of Contents
Fetching ...

Robust Multi-modal Task-oriented Communications with Redundancy-aware Representations

Jingwen Fu, Ming Xiao, Zhonghao Lyu, Mikael Skoglund, Celimuge Wu

TL;DR

This work tackles robust multi-modal task-oriented communications by jointly compressing modality-specific information and reducing inter-modal redundancy. It introduces a two-stage variational information bottleneck (VIB) framework: Stage I applies uni-modal VIB to each modality, and Stage II fuses these representations and applies a multi-modal VIB (M-VIB) to enhance robustness to channel distortions. Cross-modal redundancy minimization is achieved via a mutual-information discriminator with a gradient reversal layer, enforcing independence among modalities through a variational JS-divergence bound. Experiments on MOSI and MOSEI demonstrate superior task performance and robustness under AWGN and Rayleigh channels, with MI minimization validated by discriminator outputs approaching independence and by improved accuracy in low-SNR regimes. Overall, the framework provides a principled approach to jointly optimize modality-specific compression, inter-modal redundancy suppression, and communication reliability in semantic multi-modal TOC systems.

Abstract

Semantic communications for multi-modal data can transmit task-relevant information efficiently over noisy and bandwidth-limited channels. However, a key challenge is to simultaneously compress inter-modal redundancy and improve semantic reliability under channel distortion. To address the challenge, we propose a robust and efficient multi-modal task-oriented communication framework that integrates a two-stage variational information bottleneck (VIB) with mutual information (MI) redundancy minimization. In the first stage, we apply uni-modal VIB to compress each modality separately, i.e., text, audio, and video, while preserving task-specific features. To enhance efficiency, an MI minimization module with adversarial training is then used to suppress cross-modal dependencies and to promote complementarity rather than redundancy. In the second stage, a multi-modal VIB is further used to compress the fused representation and to enhance robustness against channel distortion. Experimental results on multi-modal emotion recognition tasks demonstrate that the proposed framework significantly outperforms existing baselines in accuracy and reliability, particularly under low signal-to-noise ratio regimes. Our work provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability.

Robust Multi-modal Task-oriented Communications with Redundancy-aware Representations

TL;DR

This work tackles robust multi-modal task-oriented communications by jointly compressing modality-specific information and reducing inter-modal redundancy. It introduces a two-stage variational information bottleneck (VIB) framework: Stage I applies uni-modal VIB to each modality, and Stage II fuses these representations and applies a multi-modal VIB (M-VIB) to enhance robustness to channel distortions. Cross-modal redundancy minimization is achieved via a mutual-information discriminator with a gradient reversal layer, enforcing independence among modalities through a variational JS-divergence bound. Experiments on MOSI and MOSEI demonstrate superior task performance and robustness under AWGN and Rayleigh channels, with MI minimization validated by discriminator outputs approaching independence and by improved accuracy in low-SNR regimes. Overall, the framework provides a principled approach to jointly optimize modality-specific compression, inter-modal redundancy suppression, and communication reliability in semantic multi-modal TOC systems.

Abstract

Semantic communications for multi-modal data can transmit task-relevant information efficiently over noisy and bandwidth-limited channels. However, a key challenge is to simultaneously compress inter-modal redundancy and improve semantic reliability under channel distortion. To address the challenge, we propose a robust and efficient multi-modal task-oriented communication framework that integrates a two-stage variational information bottleneck (VIB) with mutual information (MI) redundancy minimization. In the first stage, we apply uni-modal VIB to compress each modality separately, i.e., text, audio, and video, while preserving task-specific features. To enhance efficiency, an MI minimization module with adversarial training is then used to suppress cross-modal dependencies and to promote complementarity rather than redundancy. In the second stage, a multi-modal VIB is further used to compress the fused representation and to enhance robustness against channel distortion. Experimental results on multi-modal emotion recognition tasks demonstrate that the proposed framework significantly outperforms existing baselines in accuracy and reliability, particularly under low signal-to-noise ratio regimes. Our work provides a principled framework that jointly optimizes modality-specific compression, inter-modal redundancy, and communication reliability.

Paper Structure

This paper contains 37 sections, 1 theorem, 56 equations, 9 figures, 1 table, 1 algorithm.

Key Result

Proposition 1

For the optimal discriminator $T_{\mathrm{it}}$, the objective (loss function) in eq:logsigmoid_mi satisfies with $\mathcal{J}_{\log\sigma}(Z^i;Z^t)=0$ iff $p_{z^i,z^t}=p_{z^i}p_{z^t}$, while the upper bound $2\log 2$ is attained in the limit of perfectly separable distributions $p_{z^i,z^t}$ and $p_{z^i}p_{z^t}$.

Figures (9)

  • Figure 1: The proposed framework for multi-modal TOC.
  • Figure 2: Uni-modal VIB.
  • Figure 3: Multi-modal VIB.
  • Figure 4: Redundancy among different modalities.
  • Figure 5: System performance on the MOSEI dataset under AWGN and Rayleigh fading channels.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Proposition 1