Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors
Saad Masrur, Jung-Fu, Cheng, Atieh R. Khamesi, Ismail Guvenc
TL;DR
This work tackles accurate indoor localization in highly NLOS environments under tight computational constraints by introducing Sensor Snapshot Tokenization (SST) and a lightweight L-SwiGLU-T transformer. SST preserves per-sensor PDP semantics, enabling multivariate correlation learning with far fewer tokens than patch-based approaches, thereby reducing inference cost and data requirements. The L-SwiGLU-T architecture replaces LN with RMSNorm, uses a SwiGLU-based FFN, and eliminates the class token and positional embeddings in favor of global pooling, achieving substantial accuracy improvements (e.g., 90th percentile errors down to $0.388$ m for Vanilla-T and $0.355$ m for L-SwiGLU-T with SST) across simulated and real-world datasets. The approach demonstrates strong generalization across sensor counts and environments, and outperforms larger CNN/transformer baselines while using far fewer FLOPs, highlighting its practicality for real-time, resource-constrained wireless localization in 5G/6G contexts.
Abstract
Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to poor accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU-T) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. Experimental results on simulated and real-world datasets demonstrate that SST and L-SwiGLU-T achieve substantial accuracy and efficiency gains, outperforming larger Transformer and CNN baselines by over 40% while using significantly fewer FLOPs and training samples.
