Table of Contents
Fetching ...

Novelty detection on path space

Ioannis Gasteratos, Antoine Jacquier, Maud Lemercier, Terry Lyons, Cristopher Salvi

TL;DR

This paper reframes novelty detection for trajectories as a hypothesis test on path space using signature-based statistics. It derives tail bounds for false positives via transportation-cost inequalities, extending beyond Gaussian measures to laws of RDE solutions, and provides exact smooth CVaR surrogates expressible through the expected signature, enabling OC-SVMs that optimise smooth CVaR. It also establishes lower bounds on type-II error for absolutely continuous alternatives, yielding general power guarantees, and validates the approach with synthetic anomalous diffusion data and real RNA nanopore sequencing data. The results offer principled, non-Gaussian, path-space testing tools and practical anomaly detection capabilities in complex, high-dimensional sequential data contexts.

Abstract

We frame novelty detection on path space as a hypothesis testing problem with signature-based test statistics. Using transportation-cost inequalities of Gasteratos and Jacquier (2023), we obtain tail bounds for false positive rates that extend beyond Gaussian measures to laws of RDE solutions with smooth bounded vector fields, yielding estimates of quantiles and p-values. Exploiting the shuffle product, we derive exact formulae for smooth surrogates of conditional value-at-risk (CVaR) in terms of expected signatures, leading to new one-class SVM algorithms optimising smooth CVaR objectives. We then establish lower bounds on type-$\mathrm{II}$ error for alternatives with finite first moment, giving general power bounds when the reference measure and the alternative are absolutely continuous with respect to each other. Finally, we evaluate numerically the type-$\mathrm{I}$ error and statistical power of signature-based test statistic, using synthetic anomalous diffusion data and real-world molecular biology data.

Novelty detection on path space

TL;DR

This paper reframes novelty detection for trajectories as a hypothesis test on path space using signature-based statistics. It derives tail bounds for false positives via transportation-cost inequalities, extending beyond Gaussian measures to laws of RDE solutions, and provides exact smooth CVaR surrogates expressible through the expected signature, enabling OC-SVMs that optimise smooth CVaR. It also establishes lower bounds on type-II error for absolutely continuous alternatives, yielding general power guarantees, and validates the approach with synthetic anomalous diffusion data and real RNA nanopore sequencing data. The results offer principled, non-Gaussian, path-space testing tools and practical anomaly detection capabilities in complex, high-dimensional sequential data contexts.

Abstract

We frame novelty detection on path space as a hypothesis testing problem with signature-based test statistics. Using transportation-cost inequalities of Gasteratos and Jacquier (2023), we obtain tail bounds for false positive rates that extend beyond Gaussian measures to laws of RDE solutions with smooth bounded vector fields, yielding estimates of quantiles and p-values. Exploiting the shuffle product, we derive exact formulae for smooth surrogates of conditional value-at-risk (CVaR) in terms of expected signatures, leading to new one-class SVM algorithms optimising smooth CVaR objectives. We then establish lower bounds on type- error for alternatives with finite first moment, giving general power bounds when the reference measure and the alternative are absolutely continuous with respect to each other. Finally, we evaluate numerically the type- error and statistical power of signature-based test statistic, using synthetic anomalous diffusion data and real-world molecular biology data.

Paper Structure

This paper contains 19 sections, 10 theorems, 98 equations, 4 figures.

Key Result

Lemma 3.2

For any real-valued random variable $Z$ and any $\alpha \in [0,1)$, we have

Figures (4)

  • Figure 1: Brownian motion perturbed by a spike. (a) AUROC as a function of the spike intensity for the distance to the expected signature. (b) Multiple testing false discovery rate and power at level $\alpha=0.1$ comparing empirical p-values (from 1,000 samples) with p-values obtained from a Weibull tail-bound fitted using 100,000 samples. (c) Weibull tail-bound fit (d) Single hypothesis false positive rate and power at level $\alpha=0.01$.
  • Figure 2: Comparison of different test statistics. Comparison of ocsvm, conformance score, distance to the expected signature, and TAMSD with $\tau\in\{1,2,4,16,512\}$.
  • Figure 3: Modification detection in nanopore reads with ocsvm. Bottom: per-read p-values (with multiple testing correction) at each site; non-significant p-values at level 0.20 are rendered light grey. Top: for each site, the proportion of significant reads. Left: signature features. Right: mean-current and dwell-time features.
  • Figure :

Theorems & Definitions (22)

  • Definition 3.1: CVaR
  • Lemma 3.2
  • Theorem 3.3
  • Definition 3.4
  • Definition 3.5
  • Example 1: RDEs with Gaussian drivers
  • Lemma 3.7
  • Corollary 3.8
  • Theorem 3.9: Type-II error
  • Remark 3.10
  • ...and 12 more