Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants

Björn H. Wehlin

Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants

Björn H. Wehlin

TL;DR

The paper addresses scalable computation of similarity and inner-product matrices for large collections of piecewise constant functions (PCFs) using rectangle iteration to avoid fixed grids and enable linear-time, allocation-free calculations. By formalizing PCFs with combinations, reductions, and functionals, it supports integrated combination matrices and PCF integrals to produce distance and Gram matrices, with time-dependent and weighted variants. Key contributions include a reduction-tree approach with memory-reuse via a reduction accumulator, a GPU-accelerated masspcf implementation, and multidimensional PCF array support, achieving practical scalability on multi-GPU hardware (e.g., 500k PCFs across 8 GPUs in about 423 seconds). These advances enable large-scale PCF-based analyses in fields like TDA and computational statistics, providing a robust, high-performance toolkit for pairwise similarity computations at machine precision.

Abstract

We present a computational framework for piecewise constant functions (PCFs) and use this for several types of computations that are useful in statistics, e.g., averages, similarity matrices, and so on. We give a linear-time, allocation-free algorithm for working with pairs of PCFs at machine precision. From this, we derive algorithms for computing reductions of several PCFs. The algorithms have been implemented in a highly scalable fashion for parallel execution on CPU and, in some cases, (multi-)GPU, and are provided in a \proglang{Python} package. In addition, we provide support for multidimensional arrays of PCFs and vectorized operations on these. As a stress test, we have computed a distance matrix from 500,000 PCFs using 8 GPUs.

Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants

TL;DR

Abstract

Paper Structure (16 sections, 11 equations, 7 figures, 4 algorithms)

This paper contains 16 sections, 11 equations, 7 figures, 4 algorithms.

Introduction
Piecewise constant functions
Combinations and reductions
Functionals
Generalized induced maps
Integration of PCF combinations
Integrated combination matrices
PCF integrals
Reductions
Computing reduction pairs
User guide
Multidimensional arrays
Matrix computations
Implementation details and performance
Future work
...and 1 more sections

Figures (7)

Figure 1: Three steps of rectangle iteration. In the first step, the pointer belonging to $f$ (function drawn in blue) moves since a change in $f$ occurs before $g$. This is also true for the second step, but in the third step, $g$ now changes before $f$, so the second pointer moves. The scanline, shown as a dashed line, moves together with the last pointer shift and determines the current time point. The (transposed) matrix representations of $f$ and $g$ are shown below the plot, together with the respective pointers.
Figure 2: Example reduction tree on eight PCFs $f_1,\ldots,f_8$. Here, we split the PCFs in groups of at most three PCFs at the leaf level of the tree, although other configurations are possible. Typically, we will use binary reduction trees. One of the reductions (marked as No-Op) is inserted to make the programming a bit easier but could also be left out ($f_7 \oplus f_8$ would then have to feed directly into the last reduction on the left). Each column in the tree uses one reduction accumulator and at the end of the reduction we return the value stored in the leftmost column.
Figure 3: Speedup on synthetic dataset, running on (a) multiple CPUs, and (b) multiple GPUs. Times are averaged over 10 runs. The dataset is regenerated with a new random seed for each run. In addition, we plot fitted Amdahl curves and display the program parallel portion as a percentage. We used 2,500 PCFs for the CPU side and 50,000 PCFs when running on GPU.
Figure 4: Benchmarks using different floating point precision (32 vs 64-bit) on CPU/GPU. We display wall running times (a) and speedups (b) for different numbers of PCFs. For $M<500$ PCFs, we display the mean of 10 runs for each $M$, and for $M \geq 500$, we instead use the mean over 3 runs. We see that for a small number of PCFs, the CPU implementation is faster, so in the implementation, we automatically switch to use CPU for small datasets. We also benchmarked using 32-bit floats on CPU but as expected the performance is nearly identical to the 64-bit case, so we do not include this benchmark in the figures. On GPU, using lower precision is significantly faster.
Figure 5: A representative sample of three PCFs generated from the synthetic data generation procedure described in the appendix. Here, $f_1$, $f_2$ and $f_3$ have 979, 217 and 834 time points, respectively.
...and 2 more figures

Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants

TL;DR

Abstract

Massively Parallel Computation of Similarity Matrices from Piecewise Constant Invariants

Authors

TL;DR

Abstract

Table of Contents

Figures (7)