Table of Contents
Fetching ...

Taming Vision Priors for Data Efficient mmWave Channel Modeling

Zhenlin An, Longfei Shangguan, John Kaewell, Philip Pietraski, Jelena Senic, Camillo Gentile, Nada Golmie, Kyle Jamieson

Abstract

Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10$\times$ while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.

Taming Vision Priors for Data Efficient mmWave Channel Modeling

Abstract

Accurately modeling millimeter-wave (mmWave) propagation is essential for real-time AR and autonomous systems. Differentiable ray tracing offers a physics-grounded solution but still facing deployment challenges due to its over-reliance on exhaustive channel measurements or brittle, hand-tuned scene models for material properties. We present VisRFTwin, a scalable and data-efficient digital-twin framework that integrates vision-derived material priors with differentiable ray tracing. Multi-view images from commodity cameras are processed by a frozen Vision-Language Model to extract dense semantic embeddings, which are translated into initial estimates of permittivity and conductivity for scene surfaces. These priors initialize a Sionna-based differentiable ray tracer, which rapidly calibrates material parameters via gradient descent with only a few dozen sparse channel soundings. Once calibrated, the association between vision features and material parameters is retained, enabling fast transfer to new scenarios without repeated calibration. Evaluations across three real-world scenarios, including office interiors, urban canyons, and dynamic public spaces show that VisRFTwin reduces channel measurement needs by up to 10 while achieving a 59% lower median delay spread error than pure data-driven deep learning methods.
Paper Structure (33 sections, 12 equations, 26 figures, 3 tables)

This paper contains 33 sections, 12 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: System Workflow. VisRFTwin leverages 3D scene reconstruction (NeRF) and pretrained vision–language models to extract semantic material features from images. These vision priors are translated into electromagnetic parameters to bootstrap differentiable ray tracing, reducing the amount of channel measurements needed for channel modeling.
  • Figure 2: Differentiable ray tracing framework. (a) Real-world ray tracing visualization with rays, transmitters, receivers, and interaction points. (b) End-to-end differentiable training loop using channel mismatch as gradient feedback.
  • Figure 3: Comparison of material understanding methods in indoor environments. (a) Input image of a complex indoor scene. (b) Classic material segmentation models struggle with generalization, producing coarse and inaccurate segmentation that fails to adapt to the diversity of real-world materials. (c) The Segment Anything Model (SAM) cen_segment_2024 offers high-quality instance boundaries but lacks the ability to differentiate materials, limiting its applicability for material-aware tasks. (d, e) Our vision-guided model extracts rich semantic features using CLIP and simultaneously predicts accurate depth maps, enabling projection onto 3D surfaces.
  • Figure 4: Feature Re-projection.
  • Figure 5: 3D Semantic Field. The colors encode semantic features, with different hues indicating distinct material categories across the scene.
  • ...and 21 more figures