Table of Contents
Fetching ...

Building 6G Radio Foundation Models with Transformer Architectures

Ahmed Aboulfotouh, Ashkan Eshaghbeigi, Hatem Abou-Zeid

TL;DR

This work proposes and demonstrates the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning with a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a selfsupervised fashion.

Abstract

Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.

Building 6G Radio Foundation Models with Transformer Architectures

TL;DR

This work proposes and demonstrates the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning with a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a selfsupervised fashion.

Abstract

Foundation deep learning (DL) models are general models, designed to learn general, robust and adaptable representations of their target modality, enabling finetuning across a range of downstream tasks. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL). Foundation models have demonstrated better generalization than traditional supervised approaches, a critical requirement for wireless communications where the dynamic environment demands model adaptability. In this work, we propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning. We introduce a Masked Spectrogram Modeling (MSM) approach to pretrain the ViT in a self-supervised fashion. We evaluate the ViT-based foundation model on two downstream tasks: Channel State Information (CSI)-based Human Activity sensing and Spectrogram Segmentation. Experimental results demonstrate competitive performance to supervised training while generalizing across diverse domains. Notably, the pretrained ViT model outperforms a four-times larger model that is trained from scratch on the spectrogram segmentation task, while requiring significantly less training time, and achieves competitive performance on the CSI-based human activity sensing task. This work demonstrates the effectiveness of ViT with MSM for pretraining as a promising technique for scalable foundation model development in future 6G networks.

Paper Structure

This paper contains 14 sections, 4 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: A sample from each class of the HSD dataset. Only the CSI of the first antenna is plotted.
  • Figure 2: A spectrogram and its segmentation from the SD dataset.
  • Figure 3: Proposed ViT Foundation Model for Radio Spectrograms.
  • Figure 4: Reconstruction results of ViT-M at various masking ratios pretrained with a $75\%$ masking ratio.
  • Figure 5: A spectrogram and its corresponding resource grid using a pooling filter of size 4.
  • ...and 3 more figures