Table of Contents
Fetching ...

Multimodal Wireless Foundation Models

Ahmed Aboulfotouh, Hatem Abou-Zeid

TL;DR

The paper addresses the limitation of single-modality WFMs by introducing a multimodal wireless foundation model that jointly processes raw IQ streams and image-like modalities such as spectrograms and CSI. It proposes a ViT-based masked autoencoder with modality-specific embeddings and a shared encoder, trained via a self-supervised masked wireless modeling objective on unlabeled spectrogram and IQ data. The model demonstrates competitive performance across six downstream tasks and benefits from LoRA fine-tuning, highlighting strong cross-modality transfer and the potential for broader AI-native 6G capabilities. Overall, this work provides a concrete step toward joint sensing, communication, and localization using a single, multimodal backbone capable of efficient adaptation and robust performance.

Abstract

Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.

Multimodal Wireless Foundation Models

TL;DR

The paper addresses the limitation of single-modality WFMs by introducing a multimodal wireless foundation model that jointly processes raw IQ streams and image-like modalities such as spectrograms and CSI. It proposes a ViT-based masked autoencoder with modality-specific embeddings and a shared encoder, trained via a self-supervised masked wireless modeling objective on unlabeled spectrogram and IQ data. The model demonstrates competitive performance across six downstream tasks and benefits from LoRA fine-tuning, highlighting strong cross-modality transfer and the potential for broader AI-native 6G capabilities. Overall, this work provides a concrete step toward joint sensing, communication, and localization using a single, multimodal backbone capable of efficient adaptation and robust performance.

Abstract

Wireless foundation models (WFMs) have recently demonstrated promising capabilities, jointly performing multiple wireless functions and adapting effectively to new environments. However, while current WFMs process only one modality, depending on the task and operating conditions, the most informative modality changes and no single modality is best for all tasks. WFMs should therefore be designed to accept multiple modalities to enable a broader and more diverse range of tasks and scenarios. In this work, we propose and build the first multimodal wireless foundation model capable of processing both raw IQ streams and image-like wireless modalities (e.g., spectrograms and CSI) and performing multiple tasks across both. We introduce masked wireless modeling for the multimodal setting, a self-supervised objective and pretraining recipe that learns a joint representation from IQ streams and image-like wireless modalities. We evaluate the model on five tasks across both modality families: image-based (human activity sensing, RF signal classification, 5G NR positioning) and IQ-based (RF device fingerprinting, interference detection/classification). The multimodal WFM is competitive with single-modality WFMs, and in several cases surpasses their performance. Our results demonstrates the strong potential of developing multimodal WFMs that support diverse wireless tasks across different modalities. We believe this provides a concrete step toward both AI-native 6G and the vision of joint sensing, communication, and localization.

Paper Structure

This paper contains 12 sections, 8 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The Proposed Multimodal Wireless Foundation Model.
  • Figure 2: Pretraining Datasets.
  • Figure 3: Reconstruction examples at different masking ratios.