A Multi-Modal Foundational Model for Wireless Communication and Sensing

Vahid Yazdnian; Yasaman Ghasempour

A Multi-Modal Foundational Model for Wireless Communication and Sensing

Vahid Yazdnian, Yasaman Ghasempour

TL;DR

The paper tackles the limited generalization and high data demands of AI in wireless systems by introducing a task-agnostic, multi-modal foundation model that learns physics-aware representations across channel, geometry, and location modalities. It uses a physics-guided self-supervised pretraining regime with a dedicated [PHYSC] token to capture EM propagation correlations, and demonstrates transfer to localization and MIMO hybrid precoding with limited labeled data. A large synthetic pretraining dataset (≈$3.5$ million samples across ≈$10{,}000$ outdoor urban scenes at $28.5$ GHz) enables robust cross-scene generalization and data efficiency, outperforming task-specific baselines in several metrics. The results suggest that physics-informed, cross-modal representations can substantially improve both localization accuracy and wireless throughput while reducing the need for extensive on-site data collection, signaling a practical path toward universal wireless intelligence.

Abstract

Artificial intelligence is a key enabler for next-generation wireless communication and sensing. Yet, today's learning-based wireless techniques do not generalize well: most models are task-specific, environment-dependent, and limited to narrow sensing modalities, requiring costly retraining when deployed in new scenarios. This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems that learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments. Our framework employs a physics-guided self-supervised pretraining strategy incorporating a dedicated physical token to capture cross-modal physical correspondences governed by electromagnetic propagation. The learned representations enable efficient adaptation to diverse downstream tasks, including massive multi-antenna optimization, wireless channel estimation, and device localization, using limited labeled data. Our extensive evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.

A Multi-Modal Foundational Model for Wireless Communication and Sensing

TL;DR

million samples across ≈

outdoor urban scenes at

GHz) enables robust cross-scene generalization and data efficiency, outperforming task-specific baselines in several metrics. The results suggest that physics-informed, cross-modal representations can substantially improve both localization accuracy and wireless throughput while reducing the need for extensive on-site data collection, signaling a practical path toward universal wireless intelligence.

Abstract

Paper Structure (36 sections, 29 equations, 15 figures, 1 table)

This paper contains 36 sections, 29 equations, 15 figures, 1 table.

Introduction
Related Work
Multi-Modal Foundational Model
Preliminaries and Fundamental Design Decisions
Foundational Model
Physics-Informed Self-Supervised Pretraining
Pretraining Dataset
Interpreting Attention Patterns in the Pretrained Foundational Model Through a Physical Lens
Evaluation of Pretrained Foundational Model to Downstream Applications
Down Stream Tasks
Results and Key Takeaways
Discussion
Details on Backbone Model Architecture
Pretraining with Physics Informed Self-Supervised Learning
Masked CSI Reconstruction Loss
...and 21 more sections

Figures (15)

Figure 1: Foundational models serving application layer Artificial Neural Networks (ANNs) in wireless communication and sensing.
Figure 2: Overview of the proposed multi-modal foundation model architecture and its physics-guided end-to-end learning pipeline.
Figure 3: $[\mathrm{PHYSC}]$ token attention score on scene patches reveals learned meaningful patterns that reflect EM wave propagation characteristics when interacting with scene components, including LOS/NLOS and reflection effects.
Figure 4: Illustration of spatial spectrum.
Figure 5: Results on adapting the foundational model to wireless localization: (a) t-SNE visualization of raw CSI and the encoded $[\mathrm{PHYSC}]$ token; localization error vs the number of training samples in (b) scene-specific, and (c) cross-scene generalization settings.
...and 10 more figures

A Multi-Modal Foundational Model for Wireless Communication and Sensing

TL;DR

Abstract

A Multi-Modal Foundational Model for Wireless Communication and Sensing

Authors

TL;DR

Abstract

Table of Contents

Figures (15)