Table of Contents
Fetching ...

Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection

Bingzheng Xia, Zujie Ren, Kuang Ma, Xiaoqian Li, Wenda Li, Shuibing He

TL;DR

This work addresses real-time single-pulse detection in radio astronomy where the Heimdall pipeline suffers from GPU stalls and underutilization due to serial per-DM processing and heavy host–device data transfers. It introduces Heimdall++, a comprehensive end-to-end redesign that adds fine-grained parallelism across DM trials, uses a shared device-memory allocator, and adopts CUDA Unified Memory to minimize host-device traffic, plus a two-stage multi-file pipeline to overlap CPU I/O with GPU compute. The authors demonstrate up to 2.66× speedup on a 1 GB file and 2.05× in batch processing on an RTX 3080 Ti while preserving scientific equivalence. The work provides a scalable blueprint for real-time, high-throughput single-pulse searches in next-generation radio astronomy surveys.

Abstract

With the increasing time and frequency resolution of modern radio telescopes and the exponential growth in observational data volumes, real-time single-pulse detection has become a critical requirement for time-domain radio astronomy. Heimdall, as a representative GPU-accelerated single-pulse search tool, offers substantial performance advantages over CPU-based approaches. However, its sequential execution model and resource contention in intermediate processing stages limit GPU utilization, leading to suboptimal throughput and increased computational latency. To address these limitations, we present Heimdall++, an optimized successor to Heimdall that incorporates fine-grained GPU parallelization, enhanced memory management, and a multi-threaded framework to decouple CPU-bound and GPU-bound processing stages. This design mitigates the GPU stall problem and improves end-to-end efficiency. We evaluated Heimdall++ on a system equipped with NVIDIA RTX 3080 Ti GPUs using both a single large-scale observational file and multiple files. Experimental results demonstrate that Heimdall++ achieves up to 2.66x speedup in single-file processing and 2.05x speedup in multi-file batch processing, while maintaining full consistency with the original Heimdall's search results.

Heimdall++: Optimizing GPU Utilization and Pipeline Parallelism for Efficient Single-Pulse Detection

TL;DR

This work addresses real-time single-pulse detection in radio astronomy where the Heimdall pipeline suffers from GPU stalls and underutilization due to serial per-DM processing and heavy host–device data transfers. It introduces Heimdall++, a comprehensive end-to-end redesign that adds fine-grained parallelism across DM trials, uses a shared device-memory allocator, and adopts CUDA Unified Memory to minimize host-device traffic, plus a two-stage multi-file pipeline to overlap CPU I/O with GPU compute. The authors demonstrate up to 2.66× speedup on a 1 GB file and 2.05× in batch processing on an RTX 3080 Ti while preserving scientific equivalence. The work provides a scalable blueprint for real-time, high-throughput single-pulse searches in next-generation radio astronomy surveys.

Abstract

With the increasing time and frequency resolution of modern radio telescopes and the exponential growth in observational data volumes, real-time single-pulse detection has become a critical requirement for time-domain radio astronomy. Heimdall, as a representative GPU-accelerated single-pulse search tool, offers substantial performance advantages over CPU-based approaches. However, its sequential execution model and resource contention in intermediate processing stages limit GPU utilization, leading to suboptimal throughput and increased computational latency. To address these limitations, we present Heimdall++, an optimized successor to Heimdall that incorporates fine-grained GPU parallelization, enhanced memory management, and a multi-threaded framework to decouple CPU-bound and GPU-bound processing stages. This design mitigates the GPU stall problem and improves end-to-end efficiency. We evaluated Heimdall++ on a system equipped with NVIDIA RTX 3080 Ti GPUs using both a single large-scale observational file and multiple files. Experimental results demonstrate that Heimdall++ achieves up to 2.66x speedup in single-file processing and 2.05x speedup in multi-file batch processing, while maintaining full consistency with the original Heimdall's search results.

Paper Structure

This paper contains 18 sections, 1 equation, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Computational workflow of the original Heimdall pipeline, highlighting its sequential execution within the dispersion measure trial loop.
  • Figure 2: Stage-wise processing time in Heimdall. The yellow segment corresponds to the total runtime of the DM trials loop.
  • Figure 3: GPU utilization of Heimdall for processing a 1 GB input file.
  • Figure 4: Heimdall++ multi-stream parallelization architecture. Here, $S_1, S_2, \dots, S_n$ denote individual CUDA streams, each assigned a subset of DM trials. The DM trial loop is distributed across multiple CPU threads and CUDA streams to enable concurrent execution of Baseline Removal, Normalization, Matched Filtering, and Peak Detection.
  • Figure 5: Pipeline parallel framework of Heimdall++
  • ...and 7 more figures