Sliced-Wasserstein Distance-based Data Selection
Julien Pallage, Antoine Lesage-Landry
TL;DR
This work proposes a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches and presents the filtering patterns of the method on synthetic datasets and numerically benchmark the method for training data selection.
Abstract
We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.
