MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

Zhizhen Li; Xuanhao Luo; Xueren Ge; Longyu Zhou; Xingqin Lin; Yuchen Liu

MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

Zhizhen Li, Xuanhao Luo, Xueren Ge, Longyu Zhou, Xingqin Lin, Yuchen Liu

TL;DR

Problem: current wireless sensing methods are often task- or modality-specific, limiting generalization. MMSense presents a unified, multi-modal, multi-task foundation model that fuses image, radar, LiDAR, and text into a vision-aligned embedding space, guided by modality gating and cross-modal attention within a Vision-LLM backbone. It introduces a task-specific multi-layer attention scheme and an uncertainty-based loss to balance learning across channel-centric, environment-aware, and human-centered sensing, with a joint objective $$\mathcal{L}_{\text{total}} = \sum_t \frac{1}{2\sigma_t^2} \mathcal{L}_t + \log \sigma_t$$. The approach achieves cross-task generalization and state-of-the-art performance on real-world datasets, including zero-shot and few-shot transfer, highlighting its potential for robust, scalable wireless sensing in 6G ISAC systems.

Abstract

Large AI models have been widely adopted in wireless communications for channel modeling, beamforming, and resource optimization. However, most existing efforts remain limited to single-modality inputs and channel-specific objec- tives, overlooking the broader potential of large foundation models for unified wireless sensing. To bridge this gap, we propose MMSense, a multi-modal, multi-task foundation model that jointly addresses channel-centric, environment-aware, and human-centered sensing. Our framework integrates image, radar, LiDAR, and textual data by transforming them into vision- compatible representations, enabling effective cross-modal align- ment within a unified feature space. A modality gating mecha- nism adaptively fuses these representations, while a vision-based large language model backbone enables unified feature align- ment and instruction-driven task adaptation. Furthermore, task- specific sequential attention and uncertainty-based loss weighting mechanisms enhance cross-task generalization. Experiments on real wireless scenario datasets show that our approach outper- forms both task-specific and large-model baselines, confirming its strong generalization across heterogeneous sensing tasks.

MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

TL;DR

Abstract

MMSense: Adapting Vision-based Foundation Model for Multi-task Multi-modal Wireless Sensing

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)