E2E Learning Massive MIMO for Multimodal Semantic Non-Orthogonal Transmission and Fusion

Minghui Wu; Zhen Gao

Paper

E2E Learning Massive MIMO for Multimodal Semantic Non-Orthogonal Transmission and Fusion

Abstract

This paper investigates multimodal semantic non-orthogonal transmission and fusion in hybrid analog-digital massive multiple-input multiple-output (MIMO). A Transformer-based cross-modal source-channel semantic-aware network (CSC-SA-Net) framework is conceived, where channel state information (CSI) reference signal (RS), feedback, analog-beamforming/combining, and baseband semantic processing are data-driven end-to-end (E2E) optimized at the base station (BS) and user equipments (UEs). CSC-SA-Net comprises five sub-networks: BS-side CSI-RS network (BS-CSIRS-Net), UE-side channel semantic-aware network (UE-CSANet), BS-CSANet, UE-side multimodal semantic fusion network (UE-MSFNet), and BS-MSFNet. Specifically, we firstly E2E train BS-CSIRS-Net, UE-CSANet, and BS-CSANet to jointly design CSI-RS, feedback, analog-beamforming/combining with maximum {\emph{physical-layer's}} spectral-efficiency. Meanwhile, we E2E train UE-MSFNet and BS-MSFNet for optimizing {\emph{application-layer's}} source semantic downstream tasks. On these pre-trained models, we further integrate application-layer semantic processing with physical-layer tasks to E2E train five subnetworks. Extensive simulations show that the proposed CSC-SA-Net outperforms traditional separated designs, revealing the advantage of cross-modal channel-source semantic fusion.