Table of Contents
Fetching ...

Sliced-Wasserstein Distance-based Data Selection

Julien Pallage, Antoine Lesage-Landry

TL;DR

This work proposes a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches and presents the filtering patterns of the method on synthetic datasets and numerically benchmark the method for training data selection.

Abstract

We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.

Sliced-Wasserstein Distance-based Data Selection

TL;DR

This work proposes a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches and presents the filtering patterns of the method on synthetic datasets and numerically benchmark the method for training data selection.

Abstract

We propose a new unsupervised anomaly detection method based on the sliced-Wasserstein distance for training data selection in machine learning approaches. Our filtering technique is interesting for decision-making pipelines deploying machine learning models in critical sectors, e.g., power systems, as it offers a conservative data selection and an optimal transport interpretation. To ensure the scalability of our method, we provide two efficient approximations. The first approximation processes reduced-cardinality representations of the datasets concurrently. The second makes use of a computationally light Euclidian distance approximation. Additionally, we open the first dataset showcasing localized critical peak rebate demand response in a northern climate. We present the filtering patterns of our method on synthetic datasets and numerically benchmark our method for training data selection. Finally, we employ our method as part of a first forecasting benchmark for our open-source dataset.

Paper Structure

This paper contains 16 sections, 4 theorems, 14 equations, 12 figures, 4 tables.

Key Result

Theorem 1

There exists a constant $0<C({d, t})<+\infty$ such that, for all $\mathbb{U}, \mathbb{V} \in \mathcal{P}(\mathcal{B}(0, R))$, where $\mathcal{B}(0, R)$ is the closed ball of radius $R>0$ in $\mathbb{R}^d$ centred at the origin: where $c({d, t})=\frac{1}{d} \int_{\mathcal{S}^{d-1}_1}\|\theta\|_t^t \mathrm{~d} \theta \leq 1$.

Figures (12)

  • Figure 1: Stylized data pipeline
  • Figure 2: Empirical illustration of Proposition \ref{['prop:wasserstein_inequality']} for $\|\cdot\|_2$
  • Figure 3: Labelling of the SW filter for different values of $\epsilon$
  • Figure 4: Results of the data selection experiment
  • Figure 5: Distribution of key features for each substation
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 1: Equivalence of ${SW}_{\|\cdot\|_t,t}$ and ${W}_{\|\cdot\|_t,t}$ bonnotte2013unidimensional
  • Lemma 1
  • Lemma 2: Remark 6.6 villani2008optimal
  • Proposition 1