Parallel CPU- and GPU-based connected component algorithms for event building for hybrid pixel detectors
Tomáš Čelko, František Mráz, Benedikt Bergmann, Petr Mánek
TL;DR
The paper tackles real-time clustering for high-rate Timepix detectors by implementing parallel CPU and GPU strategies for online event building. It presents a parallel connected component labeling approach based on a union-find data structure optimized for zero-suppression data, achieving up to $3\times 10^{8}$ hits/s on GPUs and offering two orders of magnitude speedup over CPU methods. The CPU solution combines step-based and data-based parallelization to mitigate I/O bottlenecks and cluster-merging challenges, while the GPU solution avoids conventional frame-based matrix clustering by focusing on non-zero hits and leveraging memory- and access-optimized union-find techniques. Experimental results on diverse datasets show high clustering fidelity (IoU > $99.99\%$ in most cases) and demonstrate the practical viability of real-time clustering for high-rate hybrid pixel detectors, enabling significant data reduction and faster online analysis. Overall, the work substantiates real-time clustering as a scalable, impactful approach for Timepix4-scale data throughput.
Abstract
The latest generation of Timepix series hybrid pixel detectors enhance particle tracking with high spatial and temporal resolution. However, their high hit-rate capability poses challenges for data processing, particularly in multidetector configurations or systems like Timepix4. Storing and processing each hit offline is inefficient for such high data throughput. To efficiently group partly unsorted pixel hits into clusters for particle event characterization, we explore parallel approaches for online clustering to enable real-time data reduction. Although using multiple CPU cores improved throughput, scaling linearly with the number of cores, load-balancing issues between processing and I/O led to occasional data loss. We propose a parallel connected component labeling algorithm using a union-find structure with path compression optimized for zero-suppression data encoding. Our GPU implementation achieved a throughput of up to 300 million hits per second, providing a two-order-of-magnitude speedup over compared CPU-based methods while also freeing CPU resources for I/O handling and reducing the data loss.
