Table of Contents
Fetching ...

Novelty Detection on Radio Astronomy Data using Signatures

Paola Arrubarrena, Maud Lemercier, Bojan Nikolic, Terry Lyons, Thomas Cass

TL;DR

SigNova is introduced, a new semi-supervised framework for detecting anomalies in streamed data that depends on the RFI pattern rather than on the size of the observation window and improves the detection of various types of RFI in time-frequency visibility data.

Abstract

We introduce SigNova, a new semi-supervised framework for detecting anomalies in streamed data. While our initial examples focus on detecting radio-frequency interference (RFI) in digitized signals within the field of radio astronomy, it is important to note that SigNova's applicability extends to any type of streamed data. The framework comprises three primary components. Firstly, we use the signature transform to extract a canonical collection of summary statistics from observational sequences. This allows us to represent variable-length visibility samples as finite-dimensional feature vectors. Secondly, each feature vector is assigned a novelty score, calculated as the Mahalanobis distance to its nearest neighbor in an RFI-free training set. By thresholding these scores we identify observation ranges that deviate from the expected behavior of RFI-free visibility samples without relying on stringent distributional assumptions. Thirdly, we integrate this anomaly detector with Pysegments, a segmentation algorithm, to localize consecutive observations contaminated with RFI, if any. This approach provides a compelling alternative to classical windowing techniques commonly used for RFI detection. Importantly, the complexity of our algorithm depends on the RFI pattern rather than on the size of the observation window. We demonstrate how SigNova improves the detection of various types of RFI (e.g., broadband and narrowband) in time-frequency visibility data. We validate our framework on the Murchison Widefield Array (MWA) telescope and simulated data and the Hydrogen Epoch of Reionization Array (HERA).

Novelty Detection on Radio Astronomy Data using Signatures

TL;DR

SigNova is introduced, a new semi-supervised framework for detecting anomalies in streamed data that depends on the RFI pattern rather than on the size of the observation window and improves the detection of various types of RFI in time-frequency visibility data.

Abstract

We introduce SigNova, a new semi-supervised framework for detecting anomalies in streamed data. While our initial examples focus on detecting radio-frequency interference (RFI) in digitized signals within the field of radio astronomy, it is important to note that SigNova's applicability extends to any type of streamed data. The framework comprises three primary components. Firstly, we use the signature transform to extract a canonical collection of summary statistics from observational sequences. This allows us to represent variable-length visibility samples as finite-dimensional feature vectors. Secondly, each feature vector is assigned a novelty score, calculated as the Mahalanobis distance to its nearest neighbor in an RFI-free training set. By thresholding these scores we identify observation ranges that deviate from the expected behavior of RFI-free visibility samples without relying on stringent distributional assumptions. Thirdly, we integrate this anomaly detector with Pysegments, a segmentation algorithm, to localize consecutive observations contaminated with RFI, if any. This approach provides a compelling alternative to classical windowing techniques commonly used for RFI detection. Importantly, the complexity of our algorithm depends on the RFI pattern rather than on the size of the observation window. We demonstrate how SigNova improves the detection of various types of RFI (e.g., broadband and narrowband) in time-frequency visibility data. We validate our framework on the Murchison Widefield Array (MWA) telescope and simulated data and the Hydrogen Epoch of Reionization Array (HERA).
Paper Structure (26 sections, 26 equations, 19 figures, 2 tables, 1 algorithm)

This paper contains 26 sections, 26 equations, 19 figures, 2 tables, 1 algorithm.

Figures (19)

  • Figure 1: Schematic of name. Panel (A) represents the training dataset (corpus). It corresponds to visibility data labeled as "clean". Each datum is itself an ensemble of $N_A$ streams whose expected signature can be queried on any time interval. Panel (B) illustrates how the RFI-detection framework operates on new visibility data, that is, a new ensemble of $N_A$ streams (associated with one antenna in a frequency channel). The segmentation algorithm determines dynamically on which interval $[s,t]$ one should analyse the signal, that is, test whether it is RFI-free. At every step, the analysis consists of computing an anomaly score on $[s,t]$. This score is obtained by computing the minimum of the (Mahalanobis) distances between the new data and every datum in the corpus (Panel (A)). Panel (C) shows the output of the framework: a collection of disjoint intervals which have been marked as "clean", that is RFI-free. Based on this output, one can determine the RFI localisation.
  • Figure 2: A UMAP mcinnes2018uniform representation of the expected signature for each antenna in the corpus, calibration, and test sets. The dataset dimensions are denoted as (N_Ant, Sig), with an example size for the corpus being (213, 62) due to truncation of the signature at level 5. UMAP projects this into a lower dimension of (213, 2). The top plot illustrates a test set without RFI, while the bottom one depicts a test set with high RFI contamination. This example uses simulated data, with intentionally high RFI for explanatory purposes.
  • Figure 3: CASA simulations with different types of RFI contaminating only antenna 1. The ground truth, illustrating the amplitude difference, is depicted in the plot on the right. The subsequent plots feature ssins, AOFlagger, and name, respectively.
  • Figure 4: CASA simulations with different types of RFI contaminating only baseline 1 (antenna 1 and antenna 2). The ground truth, illustrating the amplitude difference, is depicted in the plot on the right. The subsequent plots feature ssins, AOFlagger, and name, respectively.
  • Figure 5: Narrowband MWA data example. The ssins results are shown on the left, the AOFlagger one in the center, and name on the right.
  • ...and 14 more figures

Theorems & Definitions (6)

  • Definition 2.1: Signature
  • Example 1
  • Definition 2.2: $\mu$-variance norm
  • Definition 2.3: $\mu$-variance distance
  • Remark 2.4
  • Remark 3.1