Table of Contents
Fetching ...

Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors

Saad Masrur, Jung-Fu, Cheng, Atieh R. Khamesi, Ismail Guvenc

TL;DR

This work tackles accurate indoor localization in highly NLOS environments under tight computational constraints by introducing Sensor Snapshot Tokenization (SST) and a lightweight L-SwiGLU-T transformer. SST preserves per-sensor PDP semantics, enabling multivariate correlation learning with far fewer tokens than patch-based approaches, thereby reducing inference cost and data requirements. The L-SwiGLU-T architecture replaces LN with RMSNorm, uses a SwiGLU-based FFN, and eliminates the class token and positional embeddings in favor of global pooling, achieving substantial accuracy improvements (e.g., 90th percentile errors down to $0.388$ m for Vanilla-T and $0.355$ m for L-SwiGLU-T with SST) across simulated and real-world datasets. The approach demonstrates strong generalization across sensor counts and environments, and outperforms larger CNN/transformer baselines while using far fewer FLOPs, highlighting its practicality for real-time, resource-constrained wireless localization in 5G/6G contexts.

Abstract

Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to poor accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU-T) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. Experimental results on simulated and real-world datasets demonstrate that SST and L-SwiGLU-T achieve substantial accuracy and efficiency gains, outperforming larger Transformer and CNN baselines by over 40% while using significantly fewer FLOPs and training samples.

Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors

TL;DR

This work tackles accurate indoor localization in highly NLOS environments under tight computational constraints by introducing Sensor Snapshot Tokenization (SST) and a lightweight L-SwiGLU-T transformer. SST preserves per-sensor PDP semantics, enabling multivariate correlation learning with far fewer tokens than patch-based approaches, thereby reducing inference cost and data requirements. The L-SwiGLU-T architecture replaces LN with RMSNorm, uses a SwiGLU-based FFN, and eliminates the class token and positional embeddings in favor of global pooling, achieving substantial accuracy improvements (e.g., 90th percentile errors down to m for Vanilla-T and m for L-SwiGLU-T with SST) across simulated and real-world datasets. The approach demonstrates strong generalization across sensor counts and environments, and outperforms larger CNN/transformer baselines while using far fewer FLOPs, highlighting its practicality for real-time, resource-constrained wireless localization in 5G/6G contexts.

Abstract

Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to poor accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU-T) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. Experimental results on simulated and real-world datasets demonstrate that SST and L-SwiGLU-T achieve substantial accuracy and efficiency gains, outperforming larger Transformer and CNN baselines by over 40% while using significantly fewer FLOPs and training samples.
Paper Structure (27 sections, 25 equations, 14 figures, 3 tables)

This paper contains 27 sections, 25 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: 3GPP Indoor Factory (InF) layout with $N_{\rm S}=18$ sensor nodes. For the InF-DH scenario, the dimensions are $L=120$ m, $W=60$ m and $D=20$ m, with 60% of the area covered by clutters of 6m height and 2m size.
  • Figure 2: Cumulative distribution function (CDF) of the received signal powers at different sensor nodes before and after power compression.
  • Figure 3: Example of PDP tokenization techniques for $N_{\rm S}=3$ sensors. The proposed Sensor Snapshot Tokenization (SST) treats each sensor' s PDP vector of size $1\times N_{\rm ts}$ as a token, while the alternative Time Snapshot Tokenization (TST) uses vectors of time samples across all sensors at a single time instant as tokens.
  • Figure 4: The architecture of the vanilla-T model, an encoder-only Transformer, is depicted in Fig. \ref{['fig:ViT1']}, while the proposed lightweight Swish-Gated Transformer, with the modified parts highlighted in light red, is shown in Fig. \ref{['fig:ViT2']}.
  • Figure 5: Scaled dot-product attention mechanism with $\tilde{N}_{\rm tk} = 3$ tokens and embedding dimension $D_{\text{emb}} = 4$. The input $\tilde{\mathbf{Z}}$ is projected into $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ using weight matrices $\mathbf{W}_q$, $\mathbf{W}_k$, and $\mathbf{W}_v$, and attention is computed using the scaled dot product of $\mathbf{Q}$ and $\mathbf{K}^T$.
  • ...and 9 more figures