Table of Contents
Fetching ...

HiSAXy: A fast methodology for solar wind structure identification in millions of time series

Hala Lamdouar, Sairam Sundaresan, Anna Jungbluth, Sudeshna Boro Saikia, Amanda Joy Camarata, Nathan Miles, Marcella Scoczynski, Mavis Stone, Andrés Muñoz-Jaramillo, Ayris Narock, Adam Szabo

TL;DR

The paper addresses the challenge of scalable, unsupervised identification of frequently occurring magnetic structures in the interplanetary magnetic field carried by the solar wind. It introduces HiSAXy, a hybrid clustering approach that combines indexable iSAX time-series representation with HDBSCAN to enable fast indexing and robust clustering of millions of IMF segments. Empirical results show that HiSAXy identifies larger, coherent clusters while maintaining intracluster self-similarity, and significantly reduces the human effort required to label discontinuities, with reported time savings on the order of hundreds of hours. This work enables scalable discovery and interpretation of solar wind structures in large PSP data and is poised to support analyses across multiple timescales and solar wind properties.

Abstract

We present a hybridized unsupervised clustering algorithm Hisaxy as a novel way to identify frequently occurring magnetic structures embedded in the interplanetary magnetic field (IMF) carried by the solar wind. The Hisaxy algorithm utilizes a combination of indexable Symbolic Aggregate approXimation (iSAX) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to efficiently identify clusters of patterns embedded in time series data. We utilized Hisaxy to identify small-scale structures, known as discontinuities, embedded in time series measurements of the IMF. In doing so, we demonstrate the capability of the algorithm to significantly reduce the amount of human analysis hours required to identify these structures, all the while maintaining a high degree of self similarity within a given cluster of time series data.

HiSAXy: A fast methodology for solar wind structure identification in millions of time series

TL;DR

The paper addresses the challenge of scalable, unsupervised identification of frequently occurring magnetic structures in the interplanetary magnetic field carried by the solar wind. It introduces HiSAXy, a hybrid clustering approach that combines indexable iSAX time-series representation with HDBSCAN to enable fast indexing and robust clustering of millions of IMF segments. Empirical results show that HiSAXy identifies larger, coherent clusters while maintaining intracluster self-similarity, and significantly reduces the human effort required to label discontinuities, with reported time savings on the order of hundreds of hours. This work enables scalable discovery and interpretation of solar wind structures in large PSP data and is poised to support analyses across multiple timescales and solar wind properties.

Abstract

We present a hybridized unsupervised clustering algorithm Hisaxy as a novel way to identify frequently occurring magnetic structures embedded in the interplanetary magnetic field (IMF) carried by the solar wind. The Hisaxy algorithm utilizes a combination of indexable Symbolic Aggregate approXimation (iSAX) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) to efficiently identify clusters of patterns embedded in time series data. We utilized Hisaxy to identify small-scale structures, known as discontinuities, embedded in time series measurements of the IMF. In doing so, we demonstrate the capability of the algorithm to significantly reduce the amount of human analysis hours required to identify these structures, all the while maintaining a high degree of self similarity within a given cluster of time series data.

Paper Structure

This paper contains 5 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: a) The original calibrated, level 2 magnetometer data from the PSP/FIELDS instrument for the year 2020. b) 30 minute interval of the magnetic field data after applying radial scaling. c) The same 30 minute interval after detrending using a rolling mean over an 1.5 hr interval, smoothing using a rolling mean over a 1 second interval, and linearly interpolating to a uniform cadence of 1 Hz. d) Zoom in to the 5 minute interval highlighted by the red shading in the top left. e) Zoom in to the 5 minute interval highlighted by the red shading in the top right. f) Piece-wise aggregate approximation of the 5 minute interval using three, 100 second intervals.
  • Figure 2: iSAX tree for indexing sequences using 3-letter words. Tree depth is represented using different colors. iSAX increases the depth of the tree by escalating the cardinality of one of the letters in each node-split. A black highlight denotes the tree node containing the sample time series at each tree level. An iSAX letter is composed of two numbers separated by a dot: range (r) and cardinality (c) -- r.c. The sample time series is represented by the iSAX word [1.2 0.2 0.2] (L1), [3.4 0.2 0.2] (L2), and [3.4 0.2 0.4] (L3). Shaded areas indicate the range of each letter for the sample time series in a given cardinality and can be use to approximate the euclidean distance between any two time series in the tree, enabling the clustering of tree nodes using HDBSCAN. The sample sequence is the same shown in Figs. \ref{['fig:data_transformations']}-e & f. Tick marks and labels have been removed for simplicity.
  • Figure 3: a) iSAX (circles) can be used as an approximate clustering algorithm by extracting all the nodes at a specified tree depth (numbers inside markers) and treat them as clusters. A deeper (larger) tree level has more specific clusters with smaller average differences between members (bottom of y-axis). However, each cluster also has fewer segments so human analysis still requires significant effort (left of x-axis). Coupling iSAX with HDBSCAN (HiSAXy; squares) combines similar nodes into clusters, reducing the number of clusters that need to be analyzed, without significantly impacting cluster specificity. Resulting in significant saving of human effort. HDBSCAN alone (star) produces even more clusters than iSAX alone, which is undesirable for our application. b) We find that combining iSAX and HDBSCAN for nodes of tree level 8 is a very good compromise between specificity and human effort. This approaches saves more than 700 hours of human effort compared to working with the original segments.