Table of Contents
Fetching ...

JEPA-MSAC: A Joint-Embedding Predictive Architecture for Multimodal Sensing-Assisted Communications

Can Zheng, Jiguang He, Guofa Cai, Nannan Li, Mehdi Bennis, Henk Wymeersch, Merouane Debbah

Abstract

Future wireless systems increasingly require predictive and transferable representations that can support multiple physical-layer (PHY) tasks under dynamic environments. However, most existing supervised learning-based methods are designed for a single task, which leads to high adaptation cost. To address this issue, we propose a joint-embedding predictive architecture for multimodal sensing-assisted communications (JEPA-MSAC), a self-supervised multimodal predictive representation learning framework for wireless environments. The proposed framework first maps multimodal sensing and communication measurements into a unified token space, and then pretrains a shared backbone using temporal block-masked JEPA to learn a predictive latent space that captures environment dynamics and cross-modal dependencies. After pretraining, the backbone is frozen and reused as a general future-feature generator, on top of which lightweight task heads are trained for localization, beam prediction, and received signal strength indicator (RSSI) prediction. Extensive experiments show the latent state supports accurate multi-task prediction with low adaptation cost. Additionally, ablation studies reveal its scaling behavior and the impact of key pretraining setups.

JEPA-MSAC: A Joint-Embedding Predictive Architecture for Multimodal Sensing-Assisted Communications

Abstract

Future wireless systems increasingly require predictive and transferable representations that can support multiple physical-layer (PHY) tasks under dynamic environments. However, most existing supervised learning-based methods are designed for a single task, which leads to high adaptation cost. To address this issue, we propose a joint-embedding predictive architecture for multimodal sensing-assisted communications (JEPA-MSAC), a self-supervised multimodal predictive representation learning framework for wireless environments. The proposed framework first maps multimodal sensing and communication measurements into a unified token space, and then pretrains a shared backbone using temporal block-masked JEPA to learn a predictive latent space that captures environment dynamics and cross-modal dependencies. After pretraining, the backbone is frozen and reused as a general future-feature generator, on top of which lightweight task heads are trained for localization, beam prediction, and received signal strength indicator (RSSI) prediction. Extensive experiments show the latent state supports accurate multi-task prediction with low adaptation cost. Additionally, ablation studies reveal its scaling behavior and the impact of key pretraining setups.

Paper Structure

This paper contains 36 sections, 43 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: System architecture of sensing-assisted mmWave V2I communications. The JEPA-MSAC processes multimodal sensing and communication data to perform specific PHY-tasks, including localization, beam prediction, and RSSI prediction.
  • Figure 2: Overall framework of the proposed JEPA-MSAC. JEPA-MSAC tokenizes multimodal observations, pretrains a JEPA backbone, and adapts frozen features to localization, beam prediction, and RSSI prediction. Modules marked with a spark icon are learnable.
  • Figure 3: Overview of task-specific prediction heads on top of the predictive latent representation learned by JEPA-MSAC. The framework supports localization, beam prediction, and RSSI prediction through lightweight heads with optional localization-guided feature fusion. Residual and direct prediction modes are employed depending on the availability of historical observations.
  • Figure 4: CDFs of the localization displacement errors for all compared methods.
  • Figure 5: Mean displacement error versus prediction horizon for different prediction methods.
  • ...and 5 more figures