Table of Contents
Fetching ...

A Dataset Similarity Evaluation Framework for Wireless Communications and Sensing

Joao Morais, Sadjad Alikhani, Akshay Malhotra, Shahab Hamidi-Rad, Ahmed Alkhateeb

TL;DR

The paper tackles the challenging problem of predicting model performance across different wireless data deployments by introducing a task-driven, model-agnostic framework to quantify dataset similarity. It leverages UMAP-based latent spaces to compute dataset distances, using Euclidean and Wasserstein metrics, and correlates these distances with cross-dataset performance on a CSI compression task. Empirical results show that latent-space distances outperform traditional raw-space metrics, with AE-based latent representations achieving the strongest correlations, while UMAP offers a practical, efficient alternative with correlations around 0.85. The framework supports smarter data selection, augmentation, and retraining decisions, and is demonstrated using a realistic DeepMIMO/ASU campus dataset for CSI compression, highlighting its potential to improve generalization and deployment of wireless ML models.

Abstract

This paper introduces a task-specific, model-agnostic framework for evaluating dataset similarity, providing a means to assess and compare dataset realism and quality. Such a framework is crucial for augmenting real-world data, improving benchmarking, and making informed retraining decisions when adapting to new deployment settings, such as different sites or frequency bands. The proposed framework is employed to design metrics based on UMAP topology-preserving dimensionality reduction, leveraging Wasserstein and Euclidean distances on latent space KNN clusters. The designed metrics show correlations above 0.85 between dataset distances and model performances on a channel state information compression unsupervised machine learning task leveraging autoencoder architectures. The results show that the designed metrics outperform traditional methods.

A Dataset Similarity Evaluation Framework for Wireless Communications and Sensing

TL;DR

The paper tackles the challenging problem of predicting model performance across different wireless data deployments by introducing a task-driven, model-agnostic framework to quantify dataset similarity. It leverages UMAP-based latent spaces to compute dataset distances, using Euclidean and Wasserstein metrics, and correlates these distances with cross-dataset performance on a CSI compression task. Empirical results show that latent-space distances outperform traditional raw-space metrics, with AE-based latent representations achieving the strongest correlations, while UMAP offers a practical, efficient alternative with correlations around 0.85. The framework supports smarter data selection, augmentation, and retraining decisions, and is demonstrated using a realistic DeepMIMO/ASU campus dataset for CSI compression, highlighting its potential to improve generalization and deployment of wireless ML models.

Abstract

This paper introduces a task-specific, model-agnostic framework for evaluating dataset similarity, providing a means to assess and compare dataset realism and quality. Such a framework is crucial for augmenting real-world data, improving benchmarking, and making informed retraining decisions when adapting to new deployment settings, such as different sites or frequency bands. The proposed framework is employed to design metrics based on UMAP topology-preserving dimensionality reduction, leveraging Wasserstein and Euclidean distances on latent space KNN clusters. The designed metrics show correlations above 0.85 between dataset distances and model performances on a channel state information compression unsupervised machine learning task leveraging autoencoder architectures. The results show that the designed metrics outperform traditional methods.

Paper Structure

This paper contains 13 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Applications enabled by dataset distance computation.
  • Figure 2: Framework system for assessing the suitability of a distance function to task, model, and a set of datasets in terms of how such a function outputs distances that correlate with performances in the specific task.
  • Figure 3: Real (left) and rendered (right) top views of the ASU campus from the DeepMIMO dataset. The rendered view shows received power distribution using a standard DFT codebook at the base station, highlighting key effects like roof diffraction.
  • Figure 4: Architecture of the autoencoder model used for the unsupervised CSI compression task, inspired by CSINet+ guo2019convolutionalneuralnetworkbased.
  • Figure 5: Visualization of latent spaces generated by different dimensionality reduction techniques. From left to right: the original space with proximity-based clustering, followed by PCA, t-SNE, UMAP, and a combination of PCA and UMAP.