Table of Contents
Fetching ...

Radiance-Field Reinforced Pretraining: Scaling Localization Models with Unlabeled Wireless Signals

Guosheng Wang, Shen Wang, Lei Yang

TL;DR

The paper tackles cross-scene generalization in RF-based indoor localization by introducing Radiance-Field Reinforced Pretraining (RFRP), a self-supervised framework that pretrains a large localization encoder (LocGPT+) with a scene-specific RF-NeRF decoder using unlabeled RF data. LocGPT+ employs a Transformer with Mixture-of-Experts to learn scene-agnostic features, while RF-NeRF enforces physics-based spectral reconstruction through voxel radiosity and ray tracing. The approach is augmented with a masked autoencoder strategy and a joint training objective, enabling effective fine-tuning with limited labeled data. Empirical results across 100 scenes show substantial localization gains over non-pretrained and supervised-pretrained baselines, demonstrating scalable, label-efficient indoor localization with strong generalization across diverse environments.

Abstract

Radio frequency (RF)-based indoor localization offers significant promise for applications such as indoor navigation, augmented reality, and pervasive computing. While deep learning has greatly enhanced localization accuracy and robustness, existing localization models still face major challenges in cross-scene generalization due to their reliance on scene-specific labeled data. To address this, we introduce Radiance-Field Reinforced Pretraining (RFRP). This novel self-supervised pretraining framework couples a large localization model (LM) with a neural radio-frequency radiance field (RF-NeRF) in an asymmetrical autoencoder architecture. In this design, the LM encodes received RF spectra into latent, position-relevant representations, while the RF-NeRF decodes them to reconstruct the original spectra. This alignment between input and output enables effective representation learning using large-scale, unlabeled RF data, which can be collected continuously with minimal effort. To this end, we collected RF samples at 7,327,321 positions across 100 diverse scenes using four common wireless technologies--RFID, BLE, WiFi, and IIoT. Data from 75 scenes were used for training, and the remaining 25 for evaluation. Experimental results show that the RFRP-pretrained LM reduces localization error by over 40% compared to non-pretrained models and by 21% compared to those pretrained using supervised learning.

Radiance-Field Reinforced Pretraining: Scaling Localization Models with Unlabeled Wireless Signals

TL;DR

The paper tackles cross-scene generalization in RF-based indoor localization by introducing Radiance-Field Reinforced Pretraining (RFRP), a self-supervised framework that pretrains a large localization encoder (LocGPT+) with a scene-specific RF-NeRF decoder using unlabeled RF data. LocGPT+ employs a Transformer with Mixture-of-Experts to learn scene-agnostic features, while RF-NeRF enforces physics-based spectral reconstruction through voxel radiosity and ray tracing. The approach is augmented with a masked autoencoder strategy and a joint training objective, enabling effective fine-tuning with limited labeled data. Empirical results across 100 scenes show substantial localization gains over non-pretrained and supervised-pretrained baselines, demonstrating scalable, label-efficient indoor localization with strong generalization across diverse environments.

Abstract

Radio frequency (RF)-based indoor localization offers significant promise for applications such as indoor navigation, augmented reality, and pervasive computing. While deep learning has greatly enhanced localization accuracy and robustness, existing localization models still face major challenges in cross-scene generalization due to their reliance on scene-specific labeled data. To address this, we introduce Radiance-Field Reinforced Pretraining (RFRP). This novel self-supervised pretraining framework couples a large localization model (LM) with a neural radio-frequency radiance field (RF-NeRF) in an asymmetrical autoencoder architecture. In this design, the LM encodes received RF spectra into latent, position-relevant representations, while the RF-NeRF decodes them to reconstruct the original spectra. This alignment between input and output enables effective representation learning using large-scale, unlabeled RF data, which can be collected continuously with minimal effort. To this end, we collected RF samples at 7,327,321 positions across 100 diverse scenes using four common wireless technologies--RFID, BLE, WiFi, and IIoT. Data from 75 scenes were used for training, and the remaining 25 for evaluation. Experimental results show that the RFRP-pretrained LM reduces localization error by over 40% compared to non-pretrained models and by 21% compared to those pretrained using supervised learning.

Paper Structure

This paper contains 24 sections, 27 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Radiance Field Reinforced Learning for Pretraining Large Localization Models. This approach integrates an LM and a RF-NeRF model into a unified encoder-decoder training framework, designed to pretrain the LM for extracting generalizable features pertinent to localization.
  • Figure 2: Illustration of Transformer Block. It consists of six key components, including the tokenization, positional encoding, self-attention, multi-head attention, layer normalization, and FFN.
  • Figure 3: Illustration of Mixture-of-Experts. The MoE layer replaces a single FFN with $N$ expert networks $\{\text{FFN}_1,\dots,\text{FFN}_N\}$. The output combines contributions from $N_s$shared experts that process all tokens, and $K$optional experts selected from the remaining $N-N_s$ via a gating network.
  • Figure 4: Architecture of NeRF$^2$ . The network consists of two MLPs, the attenuation network, and the radiance network. The attenuation network can predict the attenuation $\delta$ of any voxel. Given the TX position and a measuring direction, the radiance network can predict the signal transmitted from an arbitrary voxel.
  • Figure 5: Illustration of ray tracing. There are four voxels at $P_1 - P_4$ on the ray. Each voxel becomes a new transmitter that emits the signal along the ray to the RX. Their signals are attenuated by the other voxels between the new transmitters and the RX.
  • ...and 9 more figures