TTK is Getting MPI-Ready

Eve Le Guillou; Michael Will; Pierre Guillou; Jonas Lukasczyk; Pierre Fortin; Christoph Garth; Julien Tierny

TTK is Getting MPI-Ready

Eve Le Guillou, Michael Will, Pierre Guillou, Jonas Lukasczyk, Pierre Fortin, Christoph Garth, Julien Tierny

TL;DR

This work extends the Topology ToolKit (TTK) to distributed-memory systems using MPI, enabling scalable topological analysis pipelines on datasets far larger than a single machine. It introduces a distributed triangulation data structure (supporting both triangulated domains and regular grids), a dedicated MPI-based infrastructure for distributed pipelines, and a taxonomy of topological algorithms by their communication needs, accompanied by MPI+thread port examples and performance analyses. The authors validate the approach with an integrated pipeline running on up to 120 billion vertices across 64 nodes (1536 cores), showing that MPI preconditioning overhead is negligible relative to the pipeline time and that strong/weak scaling varies by algorithm and data characteristics. The paper provides a concrete roadmap for completing the MPI extension across TTk’s algorithms and makes the distributed TTk implementation available in TTk 1.2.0, enabling broader adoption within ParaView. This work significantly advances large-scale topological data analysis by enabling robust, multi-algorithm pipelines on distributed HPC resources.

Abstract

This system paper documents the technical foundations for the extension of the Topology ToolKit (TTK) to distributed-memory parallelism with the Message Passing Interface (MPI). While several recent papers introduced topology-based approaches for distributed-memory environments, these were reporting experiments obtained with tailored, mono-algorithm implementations. In contrast, we describe in this paper a versatile approach (supporting both triangulated domains and regular grids) for the support of topological analysis pipelines, i.e. a sequence of topological algorithms interacting together. While developing this extension, we faced several algorithmic and software engineering challenges, which we document in this paper. We describe an MPI extension of TTK's data structure for triangulation representation and traversal, a central component to the global performance and generality of TTK's topological implementations. We also introduce an intermediate interface between TTK and MPI, both at the global pipeline level, and at the fine-grain algorithmic level. We provide a taxonomy for the distributed-memory topological algorithms supported by TTK, depending on their communication needs and provide examples of hybrid MPI+thread parallelizations. Performance analyses show that parallel efficiencies range from 20% to 80% (depending on the algorithms), and that the MPI-specific preconditioning introduced by our framework induces a negligible computation time overhead. We illustrate the new distributed-memory capabilities of TTK with an example of advanced analysis pipeline, combining multiple algorithms, run on the largest publicly available dataset we have found (120 billion vertices) on a cluster with 64 nodes (for a total of 1536 cores). Finally, we provide a roadmap for the completion of TTK's MPI extension, along with generic recommendations for each algorithm communication category.

TTK is Getting MPI-Ready

TL;DR

Abstract

Paper Structure (54 sections, 15 figures, 1 table)

This paper contains 54 sections, 15 figures, 1 table.

Introduction
Related work
Contributions
Background
Input data
Critical points
Integral lines
Discrete gradient
Distributed Model
Input distribution formalization
Decomposition
Ghost layer
Global simplex identifiers
Simplex-to-process maps
Output distribution formalization
...and 39 more sections

Figures (15)

Figure 1: Topological objects considered in this paper on a toy example (elevation $f$ on a terrain $\mathcal{M}$, (a)). The vertices of $\mathcal{M}$ can be classified based on their star into regular vertices ((b), top: PL setting, bottom: DMT setting), local minima (c), saddle points (d) or local maxima (e). Integral lines (orange curves, (f)) are curves which are tangential to the gradient of $f$.
Figure 2: The input data (a) is assumed to be loaded in the memory of $n_p$ independent processes in the form of $n_p$ disjoint blocks of data ((b), one color per block, $n_p = 4$ in this example). A layer of ghost simplices ((c), coming from adjacent blocks, matching colors) is added to each block. This local data duplication ((d), transparent) eases subsequent processing on block boundaries. A local adjacency graph is constructed to encode local neighbor relations between blocks (e).
Figure 3: Preconditioning of our distributed explicit triangulation. (a) Each process $i$ enumerates its number $n_{v_i}$ofexclusivelyowned vertices and $d$-simplices. Next, an MPI prefix sum provides a local offset for each process to generate global identifiers. (b) For each process $i$, simplices of intermediate dimensions (edges ($n_{e_i}$), triangles) are locally enumerated for contiguous intervals of global identifiers of $d$-simplices (white numbers). Next, all the intervals are sent to the process $0$ which sorts them first by simplex-to-process identifier, then by interval start, yielding a per-interval offset that each process can use to generate its global identifiers (black numbers). (c) Within a given block, the vertices at the boundary of the domain $\mathcal{M}$ are identified as non-ghost boundary vertices (large spheres). Next, a simplex which only contains boundary vertices is considered to be a boundary simplex (larger cylinders). (d) The global identifiers and boundary information of the ghost simplices are retrieved through MPI communications with the neighbor processes. The ghost simplices on the global boundary are flagged as boundary simplices (larger spheres and cylinders).
Figure 4: Preconditioning of our distributed implicit triangulation. (a) Each process $i$ computes (with shared-memory parallelism) the bounding box $\mathcal{B}_i$ of its ghosted block $\mathcal{M}_i'$. The vertex $o$, respectively $O$, is the origin of $\mathcal{M}_i'$, respectively $\mathcal{M}$, with $(X'_o, Y'_o, Z'_o)$, respectively $(X'_O, Y'_O, Z'_O)$, its floating-point coordinates. The bounding box $\mathcal{B}$ of $\mathcal{M}$ is computed (via MPI parallel reductions) from all the local $\mathcal{B}_i$. (b) Two key pieces of information are computed at this step: the dimensions of the global grid $(n_X, n_Y, n_Z)$ (the number of vertices of $\mathcal{M}$ in each direction) and the local grid offset $(X_o, Y_o, Z_o)$ (the global discrete coordinates of $o$). It is computed from $(X'_O, Y'_O, Z'_O)$, $(X'_o, Y'_o, Z'_o)$ and the floating-point spacing of the grid $(s_x, s_y, s_z)$. Following that, each process locally instantiates a global implicit triangulation model of $\mathcal{M}$. (c) Given a local vertex identifier, its global discrete coordinates $(X, Y, Z)$ in $\mathcal{M}$ are inferred from its local discrete point coordinates $(x, y, z)$ (with $x \in [0, n_x -1]$, $y \in [0, n_y -1]$, and $z \in [0, n_z -1]$, $n_x$, $n_y$ and $n_z$ being the number of vertices of the grid $\mathcal{M}_i'$ in each direction), and its local grid offsets. Next, its global identifier, $\phi_0(v)$, is determined on-the-fly by global row-major indexing.
Figure 5: Preconditioning of our distributed periodic implicit triangulation. This triangulation type is handled similarly to the implicit case, but additional ghost simplices need to be computed. Given a data block $\mathcal{M}_i$((a), orange), ParaView generates a first layer of ghost $d$-simplices ((b), blue, grey, yellow). If $\mathcal{M}_i$ was located on the boundary of the global grid $\mathcal{M}$, periodic boundary conditions must be considered by adding an extra layer of ghost $d$-simplices (arrows) for each periodic face of $\mathcal{M}$(c).
...and 10 more figures

TTK is Getting MPI-Ready

TL;DR

Abstract

TTK is Getting MPI-Ready

Authors

TL;DR

Abstract

Table of Contents

Figures (15)