Table of Contents
Fetching ...

A Wireless World Model for AI-Native 6G Networks

Ziqi Chen, Yi Ren, Yixuan Huang, Qi Sun, Nan Li, Yuhong Huang, Chih-Lin I, Yifan Li, Liang Xia

Abstract

Integrating AI into the physical layer is a cornerstone of 6G networks. However, current data-driven approaches struggle to generalize across dynamic environments because they lack an intrinsic understanding of electromagnetic wave propagation. We introduce the Wireless World Model (WWM), a multi-modal foundation framework predicting the spatiotemporal evolution of wireless channels by internalizing the causal relationship between 3D geometry and signal dynamics. Pre-trained on a massive ray-traced multi-modal dataset, WWM overcomes the data authenticity gap, further validated under real-world measurement data. Using a joint-embedding predictive architecture with a multi-modal mixture-of-experts Transformer, WWM fuses channel state information, 3D point clouds, and user trajectories into a unified representation. Across the five key downstream tasks supported by WWM, it achieves remarkable performance in seen environments, unseen generalization scenarios, and real-world measurements, consistently outperforming SOTA uni-modal foundation models and task-specific models. This paves the way for physics-aware 6G intelligence that adapts to the physical world.

A Wireless World Model for AI-Native 6G Networks

Abstract

Integrating AI into the physical layer is a cornerstone of 6G networks. However, current data-driven approaches struggle to generalize across dynamic environments because they lack an intrinsic understanding of electromagnetic wave propagation. We introduce the Wireless World Model (WWM), a multi-modal foundation framework predicting the spatiotemporal evolution of wireless channels by internalizing the causal relationship between 3D geometry and signal dynamics. Pre-trained on a massive ray-traced multi-modal dataset, WWM overcomes the data authenticity gap, further validated under real-world measurement data. Using a joint-embedding predictive architecture with a multi-modal mixture-of-experts Transformer, WWM fuses channel state information, 3D point clouds, and user trajectories into a unified representation. Across the five key downstream tasks supported by WWM, it achieves remarkable performance in seen environments, unseen generalization scenarios, and real-world measurements, consistently outperforming SOTA uni-modal foundation models and task-specific models. This paves the way for physics-aware 6G intelligence that adapts to the physical world.

Paper Structure

This paper contains 25 sections, 19 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The workflow of WWM. a, Multi-modal data source for pre-training and evaluation. The ray tracing simulation is performed on Sionna RT, which consisted of channel State Information (CSI), 3D Point Clouds and User Equipment (UE) Trajectory. The Field Measurement is collected outdoor from China Mobile 6G prototype system. b, Pre-training model architecture and pre-training tasks. The WWM is a pre-trained on JEPA, involving an encoder-predictor architecture. Both encoder and predictor are Multi-modal Mixture of Expert Transformer model, trained on 3 self-supervised mask tasks. c, Downstream tasks for validation based on WWM embedding. The representational capabilities of WWM is verified on 4 downstream tasks with simulated data, and its real-world generalization ability is evaluated on CSI frequency-domain prediction based field measurement.
  • Figure 2: Large-scale multi-modal wireless dataset spanning diverse simulated and real-world environments. Across these environments, we collect multi-modal data including scenario 3D point clouds, user trajectories and time-synchronized CSI for each sample. a, Representative real-world photographs of the selected urban environments used for ray tracing simulation. From top to bottom: Place de l’Étoile (Paris), Forbidden City (Beijing), Munich urban district (Germany), central business district (Beijing), and Wall Street (New York). b, Corresponding 3D scenario models constructed from geographic data and ground signal coverage maps generated in Sionna RT. c, Extracted 3D point clouds of corresponding 3D scenario models. d, Photograph of the real-world outdoor measurement environment used for channel data acquisition, with the base station (BS) location indicated. e, Satellite view of the measurement site, where the yellow cross marks the BS position and the green trapezoid indicates the UE trajectory. f, 3D point clouds reconstructed from the measurement environment. g, BS hardware of the 6G prototype system used for real-world measurements. h, UE device used for outdoor channel data acquisition.
  • Figure 3: The model and pre-training process. a, WWM employs a Joint Embedding Predictive Architecture (JEPA) to infer masked multi-modal features in latent space. An online encoder processes visible tokens while a predictor estimates masked embeddings, supervised by an Exponential Moving Average (EMA) based momentum encoder to ensure representation stability. b, Multi-modal Mixture of Experts (MMoE). Heterogeneous inputs—CSI, 3D point clouds, and trajectories—are tokenized via domain-specific embedders (Conv3D, Point-net, and MLP). Within each Transformer block, shared self-attention performs global cross-modal reasoning, followed by modality-specific experts (H-FFN, PC-FFN, P-FFN) to preserve physical inductive biases. c, Pre-training masking strategies. Three complementary strategies supervise the model: Fine-grained and coarse CSI masking to extract multi-scale spatio-temporal propagation features. Trajectory masking to capture kinematic dynamics and their interaction with the electromagnetic environment.
  • Figure 4: Pre-training results. a, The visualizations of CSI reconstruction. 1st graph: Original 16 timestep CSI sample. 2nd graph: Masked CSI sample used as input to the WWM models for fine-grained masking strategy. 3rd graph: Reconstructed CSI sample using masked input. b, The t-SNE van2008visualizing maps show the encoder’s final-layer embeddings across five sampling schemes—randomly sampled token-level embeddings, samples grouped by city, samples grouped by LOS/NLOS conditions, samples grouped by BS and samples grouped by noise levels.
  • Figure 5: Downstream task and results comparison a, The CSI temporal prediction task details. b, CSI temporal prediction performance comparison with SOTA WFM Wifo and SOTA task-specific model LSTM. c, CSI compression and feedback task details. d, CSI compression and feedback performance comparison with SOTA task-specific models QCR-NET and CR-NET across various compression rate. e, Beam prediction task details. f, Beam prediction across frequency bands performance comparison with SOTA WFM LWM and SOTA task-specific model SPMM. g, User localization task details. h, User localization performance comparison with SOTA task-specific model deep-CNN. i, CSI frequency-domain prediction in real-world measurement task details. j, CSI frequency-domain prediction performance comparison with SOTA WFM Wifo and SOTA task-specific model C-Mixer.
  • ...and 1 more figures