A Multi-Modal Foundational Model for Wireless Communication and Sensing
Vahid Yazdnian, Yasaman Ghasempour
TL;DR
The paper tackles the limited generalization and high data demands of AI in wireless systems by introducing a task-agnostic, multi-modal foundation model that learns physics-aware representations across channel, geometry, and location modalities. It uses a physics-guided self-supervised pretraining regime with a dedicated [PHYSC] token to capture EM propagation correlations, and demonstrates transfer to localization and MIMO hybrid precoding with limited labeled data. A large synthetic pretraining dataset (≈$3.5$ million samples across ≈$10{,}000$ outdoor urban scenes at $28.5$ GHz) enables robust cross-scene generalization and data efficiency, outperforming task-specific baselines in several metrics. The results suggest that physics-informed, cross-modal representations can substantially improve both localization accuracy and wireless throughput while reducing the need for extensive on-site data collection, signaling a practical path toward universal wireless intelligence.
Abstract
Artificial intelligence is a key enabler for next-generation wireless communication and sensing. Yet, today's learning-based wireless techniques do not generalize well: most models are task-specific, environment-dependent, and limited to narrow sensing modalities, requiring costly retraining when deployed in new scenarios. This work introduces a task-agnostic, multi-modal foundational model for physical-layer wireless systems that learns transferable, physics-aware representations across heterogeneous modalities, enabling robust generalization across tasks and environments. Our framework employs a physics-guided self-supervised pretraining strategy incorporating a dedicated physical token to capture cross-modal physical correspondences governed by electromagnetic propagation. The learned representations enable efficient adaptation to diverse downstream tasks, including massive multi-antenna optimization, wireless channel estimation, and device localization, using limited labeled data. Our extensive evaluations demonstrate superior generalization, robustness to deployment shifts, and reduced data requirements compared to task-specific baselines.
