Vision Transformer Based User Equipment Positioning
Parshwa Shah, Dhaval K. Patel, Brijesh Soni, Miguel López-Benítez, Siddhartan Govindasamy
TL;DR
This work tackles UE positioning by transforming CSI into an Angle Delay Profile image and processing it with a Vision Transformer to exploit patch-based attention. By converting the channel matrix to ADP and using ViT, the method achieves indoor RMSE of 0.55 m, outdoor DeepMIMO RMSE of 13.59 m, and ViWi blockage RMSE of 3.45 m, outperforming state-of-the-art approaches by about 38%. The approach relies on ray-tracing datasets (DeepMIMO and ViWi) and requires no additional data beyond CSI-derived ADP, demonstrating robustness in both LoS and blockage scenarios. The proposed framework offers accurate, distribution-friendly localization suitable for 5G/6G positioning use cases in varied environments.
Abstract
Recently, Deep Learning (DL) techniques have been used for User Equipment (UE) positioning. However, the key shortcomings of such models is that: i) they weigh the same attention to the entire input; ii) they are not well suited for the non-sequential data e.g., when only instantaneous Channel State Information (CSI) is available. In this context, we propose an attention-based Vision Transformer (ViT) architecture that focuses on the Angle Delay Profile (ADP) from CSI matrix. Our approach, validated on the `DeepMIMO' and `ViWi' ray-tracing datasets, achieves an Root Mean Squared Error (RMSE) of 0.55m indoors, 13.59m outdoors in DeepMIMO, and 3.45m in ViWi's outdoor blockage scenario. The proposed scheme outperforms state-of-the-art schemes by $\sim$ 38\%. It also performs substantially better than other approaches that we have considered in terms of the distribution of error distance.
