Table of Contents
Fetching ...

Simulation-Enhanced Data Augmentation for Machine Learning Pathloss Prediction

Ahmed P. Mohamed, Byunghyun Lee, Yaguang Zhang, Max Hollingsworth, C. Robert Anderson, James V. Krogmeier, David J. Love

TL;DR

The paper tackles data scarcity in ML-based pathloss prediction by introducing a simulation-enhanced data augmentation pipeline that fuses real measurements with synthetic data from a high-resolution LiDAR-enabled cellular coverage simulator. It constructs site-specific features from LiDAR-derived geographic data and trains a CatBoost model to predict pathloss. Key findings show improved generalization to unseen environments, with MAE reductions up to ~12 dB in some scenarios and robust performance even with limited real data when synthetic data is balanced with measurements. The approach reduces the need for extensive field campaigns, enabling more reliable and scalable network planning across rural, residential, and hilly terrains.

Abstract

Machine learning (ML) offers a promising solution to pathloss prediction. However, its effectiveness can be degraded by the limited availability of data. To alleviate these challenges, this paper introduces a novel simulation-enhanced data augmentation method for ML pathloss prediction. Our method integrates synthetic data generated from a cellular coverage simulator and independently collected real-world datasets. These datasets were collected through an extensive measurement campaign in different environments, including farms, hilly terrains, and residential areas. This comprehensive data collection provides vital ground truth for model training. A set of channel features was engineered, including geographical attributes derived from LiDAR datasets. These features were then used to train our prediction model, incorporating the highly efficient and robust gradient boosting ML algorithm, CatBoost. The integration of synthetic data, as demonstrated in our study, significantly improves the generalizability of the model in different environments, achieving a remarkable improvement of approximately 12dB in terms of mean absolute error for the best-case scenario. Moreover, our analysis reveals that even a small fraction of measurements added to the simulation training set, with proper data balance, can significantly enhance the model's performance.

Simulation-Enhanced Data Augmentation for Machine Learning Pathloss Prediction

TL;DR

The paper tackles data scarcity in ML-based pathloss prediction by introducing a simulation-enhanced data augmentation pipeline that fuses real measurements with synthetic data from a high-resolution LiDAR-enabled cellular coverage simulator. It constructs site-specific features from LiDAR-derived geographic data and trains a CatBoost model to predict pathloss. Key findings show improved generalization to unseen environments, with MAE reductions up to ~12 dB in some scenarios and robust performance even with limited real data when synthetic data is balanced with measurements. The approach reduces the need for extensive field campaigns, enabling more reliable and scalable network planning across rural, residential, and hilly terrains.

Abstract

Machine learning (ML) offers a promising solution to pathloss prediction. However, its effectiveness can be degraded by the limited availability of data. To alleviate these challenges, this paper introduces a novel simulation-enhanced data augmentation method for ML pathloss prediction. Our method integrates synthetic data generated from a cellular coverage simulator and independently collected real-world datasets. These datasets were collected through an extensive measurement campaign in different environments, including farms, hilly terrains, and residential areas. This comprehensive data collection provides vital ground truth for model training. A set of channel features was engineered, including geographical attributes derived from LiDAR datasets. These features were then used to train our prediction model, incorporating the highly efficient and robust gradient boosting ML algorithm, CatBoost. The integration of synthetic data, as demonstrated in our study, significantly improves the generalizability of the model in different environments, achieving a remarkable improvement of approximately 12dB in terms of mean absolute error for the best-case scenario. Moreover, our analysis reveals that even a small fraction of measurements added to the simulation training set, with proper data balance, can significantly enhance the model's performance.
Paper Structure (15 sections, 3 equations, 5 figures, 1 table)

This paper contains 15 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Flowchart of the proposed simulation-enhanced data augmentation process.
  • Figure 2: in derived from the data collected during the measurement campaign.
  • Figure 3: Illustration of the engineered features used as inputs for algorithm.
  • Figure 4: Prediction performance in the same environment, where training data consists of synthetic data structured to mimic the characteristics of the real test data environment.
  • Figure 5: MAE vs 5% Real Data Repetitions in Training Set.