Table of Contents
Fetching ...

Parallel CPU- and GPU-based connected component algorithms for event building for hybrid pixel detectors

Tomáš Čelko, František Mráz, Benedikt Bergmann, Petr Mánek

TL;DR

The paper tackles real-time clustering for high-rate Timepix detectors by implementing parallel CPU and GPU strategies for online event building. It presents a parallel connected component labeling approach based on a union-find data structure optimized for zero-suppression data, achieving up to $3\times 10^{8}$ hits/s on GPUs and offering two orders of magnitude speedup over CPU methods. The CPU solution combines step-based and data-based parallelization to mitigate I/O bottlenecks and cluster-merging challenges, while the GPU solution avoids conventional frame-based matrix clustering by focusing on non-zero hits and leveraging memory- and access-optimized union-find techniques. Experimental results on diverse datasets show high clustering fidelity (IoU > $99.99\%$ in most cases) and demonstrate the practical viability of real-time clustering for high-rate hybrid pixel detectors, enabling significant data reduction and faster online analysis. Overall, the work substantiates real-time clustering as a scalable, impactful approach for Timepix4-scale data throughput.

Abstract

The latest generation of Timepix series hybrid pixel detectors enhance particle tracking with high spatial and temporal resolution. However, their high hit-rate capability poses challenges for data processing, particularly in multidetector configurations or systems like Timepix4. Storing and processing each hit offline is inefficient for such high data throughput. To efficiently group partly unsorted pixel hits into clusters for particle event characterization, we explore parallel approaches for online clustering to enable real-time data reduction. Although using multiple CPU cores improved throughput, scaling linearly with the number of cores, load-balancing issues between processing and I/O led to occasional data loss. We propose a parallel connected component labeling algorithm using a union-find structure with path compression optimized for zero-suppression data encoding. Our GPU implementation achieved a throughput of up to 300 million hits per second, providing a two-order-of-magnitude speedup over compared CPU-based methods while also freeing CPU resources for I/O handling and reducing the data loss.

Parallel CPU- and GPU-based connected component algorithms for event building for hybrid pixel detectors

TL;DR

The paper tackles real-time clustering for high-rate Timepix detectors by implementing parallel CPU and GPU strategies for online event building. It presents a parallel connected component labeling approach based on a union-find data structure optimized for zero-suppression data, achieving up to hits/s on GPUs and offering two orders of magnitude speedup over CPU methods. The CPU solution combines step-based and data-based parallelization to mitigate I/O bottlenecks and cluster-merging challenges, while the GPU solution avoids conventional frame-based matrix clustering by focusing on non-zero hits and leveraging memory- and access-optimized union-find techniques. Experimental results on diverse datasets show high clustering fidelity (IoU > in most cases) and demonstrate the practical viability of real-time clustering for high-rate hybrid pixel detectors, enabling significant data reduction and faster online analysis. Overall, the work substantiates real-time clustering as a scalable, impactful approach for Timepix4-scale data throughput.

Abstract

The latest generation of Timepix series hybrid pixel detectors enhance particle tracking with high spatial and temporal resolution. However, their high hit-rate capability poses challenges for data processing, particularly in multidetector configurations or systems like Timepix4. Storing and processing each hit offline is inefficient for such high data throughput. To efficiently group partly unsorted pixel hits into clusters for particle event characterization, we explore parallel approaches for online clustering to enable real-time data reduction. Although using multiple CPU cores improved throughput, scaling linearly with the number of cores, load-balancing issues between processing and I/O led to occasional data loss. We propose a parallel connected component labeling algorithm using a union-find structure with path compression optimized for zero-suppression data encoding. Our GPU implementation achieved a throughput of up to 300 million hits per second, providing a two-order-of-magnitude speedup over compared CPU-based methods while also freeing CPU resources for I/O handling and reducing the data loss.

Paper Structure

This paper contains 21 sections, 5 figures, 3 algorithms.

Figures (5)

  • Figure 1: Clustering computation graph combining both step-based (Section \ref{['sec:Step-based parallelization']}) and data-based (Section \ref{['sec:data based parallelization']}) parallelizations with $n_\mathrm{datalines}=4$ data lanes. Arrows denote the data flow, and each box represents an independent worker thread.
  • Figure 2: Three different views of the hit data. Pixel matrix representation describes hits as pixels labeled by their index in the time-sorted buffer. The same hits can be represented by trees, showing how two trees are joined. Such a tree can be trivially implemented in a static array using tree parent pointers (labels).
  • Figure 3: The benchmarking dataset cluster sizes in pixels (a) and the cluster examples from the particular datasets, the pixel color indicating the ToT (b). Pion and lead data were acquired during measurements at SPS CERN beam lines. The last three datasets were artificially created by selecting a subset of large clusters from the lead dataset.
  • Figure 4: Dependence of parallel CPU clustering throughput on the number of data lanes (pipelines) in the computation graph (Figure \ref{['fig:CPU clustering pipeline']}). One can see the throughput scaling with an increasing degree of parallelization. The difference in scaling among different CPUs is probably caused mainly by their number of cores and clock frequency.
  • Figure 5: Dependence of parallel GPU clustering throughput on the number of launched cores. One of the tests was run on an Nvidia RTX 4070 Ti Super 16GB with PCIE 4.0$\times$16 with 8848 cores. The other test was run on an Nvidia RTX 3060 Laptop 6GB with PCIE 4.0$\times$4 with 3840 cores. The buffer size was set so that each thread, except the last one, processed a chunk with at least 10,000 hits.