Table of Contents
Fetching ...

Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python

Giovanni Saraceno, Marianthi Markatou, Raktim Mukhopadhyay, Mojgan Golzy

TL;DR

QuadratiK addresses GoF testing and clustering for high-dimensional and spherical data by unifying kernel-based quadratic distances with diffusion and Poisson kernels. The main approach centers on centered diffusion kernels to build $d_K(F,G)$ test statistics for one-, two-, and k-sample problems, complemented by a Poisson-kernel-based cylinder for uniformity on the sphere and an EM-style PKBD clustering algorithm. Key contributions include efficient, parallelized GoF procedures with bandwidth selection via mid-power analysis, real-data demonstrations on public datasets, and a spherical clustering framework with comprehensive visualization and validation tools. The work facilitates robust inference across disciplines by providing accessible R and Python implementations and detailed guidance on kernel selection, critical-value computation, and cluster-number determination.

Abstract

We introduce the QuadratiK package that incorporates innovative data analysis methodologies. The presented software, implemented in both R and Python, offers a comprehensive set of goodness-of-fit tests and clustering techniques using kernel-based quadratic distances, thereby bridging the gap between the statistical and machine learning literatures. Our software implements one, two and k-sample tests for goodness of fit, providing an efficient and mathematically sound way to assess the fit of probability distributions. Expanded capabilities of our software include supporting tests for uniformity on the d-dimensional Sphere based on Poisson kernel densities. Particularly noteworthy is the incorporation of a unique clustering algorithm specifically tailored for spherical data that leverages a mixture of Poisson kernel-based densities on the sphere. Alongside this, our software includes additional graphical functions, aiding the users in validating, as well as visualizing and representing clustering results. This enhances interpretability and usability of the analysis. In summary, our R and Python packages serve as a powerful suite of tools, offering researchers and practitioners the means to delve deeper into their data, draw robust inference, and conduct potentially impactful analyses and inference across a wide array of disciplines.

Goodness-of-Fit and Clustering of Spherical Data: the QuadratiK package in R and Python

TL;DR

QuadratiK addresses GoF testing and clustering for high-dimensional and spherical data by unifying kernel-based quadratic distances with diffusion and Poisson kernels. The main approach centers on centered diffusion kernels to build test statistics for one-, two-, and k-sample problems, complemented by a Poisson-kernel-based cylinder for uniformity on the sphere and an EM-style PKBD clustering algorithm. Key contributions include efficient, parallelized GoF procedures with bandwidth selection via mid-power analysis, real-data demonstrations on public datasets, and a spherical clustering framework with comprehensive visualization and validation tools. The work facilitates robust inference across disciplines by providing accessible R and Python implementations and detailed guidance on kernel selection, critical-value computation, and cluster-number determination.

Abstract

We introduce the QuadratiK package that incorporates innovative data analysis methodologies. The presented software, implemented in both R and Python, offers a comprehensive set of goodness-of-fit tests and clustering techniques using kernel-based quadratic distances, thereby bridging the gap between the statistical and machine learning literatures. Our software implements one, two and k-sample tests for goodness of fit, providing an efficient and mathematically sound way to assess the fit of probability distributions. Expanded capabilities of our software include supporting tests for uniformity on the d-dimensional Sphere based on Poisson kernel densities. Particularly noteworthy is the incorporation of a unique clustering algorithm specifically tailored for spherical data that leverages a mixture of Poisson kernel-based densities on the sphere. Alongside this, our software includes additional graphical functions, aiding the users in validating, as well as visualizing and representing clustering results. This enhances interpretability and usability of the analysis. In summary, our R and Python packages serve as a powerful suite of tools, offering researchers and practitioners the means to delve deeper into their data, draw robust inference, and conduct potentially impactful analyses and inference across a wide array of disciplines.
Paper Structure (17 sections, 29 equations, 9 figures, 3 algorithms)

This paper contains 17 sections, 29 equations, 9 figures, 3 algorithms.

Figures (9)

  • Figure 1: The classes along with their corresponding methods, and functions available in the current version of QuadratiK in R are shown here. The folder (blue) in the center represents the QuadratiK package. The double rectangle (yellow) depicts the classes, the rectangles (green) denote the methods associated with these classes. The rounded rectangles (rose-colored) represent the functions, and the parallelograms (lavender-colored) show the datasets that are available in the package.
  • Figure 2: The main modules, along with their classes, methods, and functions, available in the current version of QuadratiK in Python. The folder (blue) in the center represents the QuadratiK package. The folders (lavender-colored) depict the various modules within the QuadratiK. The classes within these modules are depicted by double rectangles (yellow), the methods associated with these classes are shown as rectangles (green). Additionally, the package includes various functions, represented by rounded rectangles (rose-colored), which are included in different modules and offer various utilities.
  • Figure 3: Figure automatically generated by the summary function on the result of the normality test. It displays the normal qq-plots (left) with a table of the standard descriptive statistics (right) for each variable.
  • Figure 4: Figure automatically generated by the summary function on the result of the two-sample test. It displays the qq-plots between the two samples (left) with a table of the standard descriptive statistics for each variable (right), computed per group and overall.
  • Figure 5: Plot generated by the select_h function on the result of the selection of $h$ algorithm for the $k=3$ two-dimensional samples, with size $n=200$, in Example 3.2. It displays the obtained power versus the considered $h$, for each value of skewness alternative $\delta$ considered.
  • ...and 4 more figures

Theorems & Definitions (6)

  • Example 3.1: Test for normality
  • Example 3.2: $k$-sample test
  • Example 3.3: Non-parametric two-sample test
  • Example 3.4: $k$-sample test -- Continued
  • Example 3.5: Two-sample test -- Continued
  • Example 3.6: Uniformity test on the Sphere