Multimodal Transformers for Wireless Communications: A Case Study in Beam Prediction
Yu Tian, Qiyang Zhao, Zine el abidine Kherroubi, Fouzi Boukhalfa, Kebin Wu, Faouzi Bader
TL;DR
This work tackles beam management for high-frequency wireless systems by leveraging multimodal sensing (camera, LiDAR, radar, GPS) with a transformer-based fusion framework. It introduces data transformations (e.g., LiDAR BEV, radar range-angle/velocity maps) and processing steps (brightness enhancement, segmentation, background filtering) alongside training strategies like focal loss and EMA. The proposed multimodal transformer, combining CNN-based feature extraction with multi-time, multi-modality fusion, achieves state-of-the-art distance-based beam prediction scores on the DeepSense6G dataset, notably with image+GPS data yielding 78.44% overall accuracy and strong generalization to unseen day/night scenarios. The results demonstrate the viability of cross-domain feature fusion for sensing-aided beam management and point toward foundation-model-style pretraining for downstream radio network tasks.
Abstract
Wireless communications at high-frequency bands with large antenna arrays face challenges in beam management, which can potentially be improved by multimodality sensing information from cameras, LiDAR, radar, and GPS. In this paper, we present a multimodal transformer deep learning framework for sensing-assisted beam prediction. We employ a convolutional neural network to extract the features from a sequence of images, point clouds, and radar raw data sampled over time. At each convolutional layer, we use transformer encoders to learn the hidden relations between feature tokens from different modalities and time instances over abstraction space and produce encoded vectors for the next-level feature extraction. We train the model on a combination of different modalities with supervised learning. We try to enhance the model over imbalanced data by utilizing focal loss and exponential moving average. We also evaluate data processing and augmentation techniques such as image enhancement, segmentation, background filtering, multimodal data flipping, radar signal transformation, and GPS angle calibration. Experimental results show that our solution trained on image and GPS data produces the best distance-based accuracy of predicted beams at 78.44%, with effective generalization to unseen day scenarios near 73% and night scenarios over 84%. This outperforms using other modalities and arbitrary data processing techniques, which demonstrates the effectiveness of transformers with feature fusion in performing radio beam prediction from images and GPS. Furthermore, our solution could be pretrained from large sequences of multimodality wireless data, on fine-tuning for multiple downstream radio network tasks.
