Table of Contents
Fetching ...

A Distributed Approach for Persistent Homology Computation on a Large Scale

Riccardo Ceccaroni, Lorenzo Di Rocco, Umberto Ferraro Petrillo, Pierpaolo Brutti

TL;DR

This work introduces PixHomology, a memory-efficient algorithm for computing $0$-dimensional persistent homology on large 2D images, and couples it with a Spark-based distributed pipeline to process massive batches of images. By avoiding adjacency matrices and leveraging max-pooling operations, PixHomology achieves substantial reductions in memory usage and competitive runtimes, especially on large-scale data. The authors conduct extensive experiments against Ripser and DIPHA across a 24-node cluster with a 36 GB image dataset, demonstrating superior scalability and efficiency for large images. The proposed approach enables scalable topological analysis in high-throughput imaging domains such as astronomy and biology, and outlines future enhancements like out-of-core processing and distributed per-image partitions. Overall, PixHomology offers a practical, scalable solution for persistent homology on big image datasets with clear advantages in memory footprint and parallel throughput.

Abstract

Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount of data. In this paper we present PixHomology, a novel algorithm for efficiently computing $0$-dimensional PH on 2D images, optimizing memory and processing time. By leveraging the Apache Spark framework, we also present a distributed version of our algorithm with several optimized variants, able to concurrently process large batches of astronomical images. Finally, we present the results of an experimental analysis showing that our algorithm and its distributed version are efficient in terms of required memory, execution time, and scalability, consistently outperforming existing state-of-the-art PH computation tools when used to process large datasets.

A Distributed Approach for Persistent Homology Computation on a Large Scale

TL;DR

This work introduces PixHomology, a memory-efficient algorithm for computing -dimensional persistent homology on large 2D images, and couples it with a Spark-based distributed pipeline to process massive batches of images. By avoiding adjacency matrices and leveraging max-pooling operations, PixHomology achieves substantial reductions in memory usage and competitive runtimes, especially on large-scale data. The authors conduct extensive experiments against Ripser and DIPHA across a 24-node cluster with a 36 GB image dataset, demonstrating superior scalability and efficiency for large images. The proposed approach enables scalable topological analysis in high-throughput imaging domains such as astronomy and biology, and outlines future enhancements like out-of-core processing and distributed per-image partitions. Overall, PixHomology offers a practical, scalable solution for persistent homology on big image datasets with clear advantages in memory footprint and parallel throughput.

Abstract

Persistent homology (PH) is a powerful mathematical method to automatically extract relevant insights from images, such as those obtained by high-resolution imaging devices like electron microscopes or new-generation telescopes. However, the application of this method comes at a very high computational cost, that is bound to explode more because new imaging devices generate an ever-growing amount of data. In this paper we present PixHomology, a novel algorithm for efficiently computing -dimensional PH on 2D images, optimizing memory and processing time. By leveraging the Apache Spark framework, we also present a distributed version of our algorithm with several optimized variants, able to concurrently process large batches of astronomical images. Finally, we present the results of an experimental analysis showing that our algorithm and its distributed version are efficient in terms of required memory, execution time, and scalability, consistently outperforming existing state-of-the-art PH computation tools when used to process large datasets.
Paper Structure (26 sections, 1 equation, 11 figures, 1 table, 1 algorithm)

This paper contains 26 sections, 1 equation, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: A point $(x,y)$ in the PD indicates a topological feature of dimension $0$ ($H_0$) born at $x$ and that persists until $y$. We call $x$ the $p_{birth}$ and $y$ the $p_{death}$. By definition, all points should lie above the diagonal. The horizontal dashed line represents infinity.
  • Figure 2: Apache Spark architecture. Example for a reference installation featuring two worker nodes and one driver application. Each worker node in this figure runs one executor process and two tasks. The overall distributed execution is orchestrated by a cluster manager.
  • Figure 3: The PH calculation using PixHomology on an image containing three components defined by Gaussian functions. Initially, each pixel is linked to its neighbor with the highest value, and PixHomology detects relative maxima as birth values. Subsequently, all the minimum or saddle points are located. The first value of these points that connects the two components represents the death value of the component with the lower birth value. Finally, the process ends with identifying the absolute minimum in the image, which serves as the ultimate death point associated with the component relative to the absolute maximum.
  • Figure 4: An overview of the distributed workflow of PixHomology on a Spark cluster involving four executor processes scattered across two computing nodes. After partitioning the URLs of the images across the various executors, each executor performs two map operations. The former operation loads the image into memory, while the latter performs the PixHomology algorithm to compute the 0-dimensional PH.
  • Figure 5: A cropped section of $500 \times 500$ pixels from an image in the dataset.
  • ...and 6 more figures