MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing
Zhizhen Li, Xuanhao Luo, Xueren Ge, Longyu Zhou, Xingqin Lin, Yuchen Liu
TL;DR
Problem: current wireless sensing methods are often task- or modality-specific, limiting generalization. MMSense presents a unified, multi-modal, multi-task foundation model that fuses image, radar, LiDAR, and text into a vision-aligned embedding space, guided by modality gating and cross-modal attention within a Vision-LLM backbone. It introduces a task-specific multi-layer attention scheme and an uncertainty-based loss to balance learning across channel-centric, environment-aware, and human-centered sensing, with a joint objective $$\mathcal{L}_{\text{total}} = \sum_t \frac{1}{2\sigma_t^2} \mathcal{L}_t + \log \sigma_t$$. The approach achieves cross-task generalization and state-of-the-art performance on real-world datasets, including zero-shot and few-shot transfer, highlighting its potential for robust, scalable wireless sensing in 6G ISAC systems.
Abstract
Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.
