Table of Contents
Fetching ...

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

Jolly Chen, Monica Dessole, Ana Lucia Varbanescu

TL;DR

The paper addresses enabling GPU-accelerated data analysis in ROOT/RDataFrame by migrating a core histogramming operation from CUDA to SYCL to support heterogeneous architectures. It documents the migration workflow, implementation choices (including SYCL reductions and USM vs buffers), and a comparative evaluation of AdaptiveCpp and DPC++ against native CUDA. Key findings show that SYCL reductions and multi-kernel fusion can improve performance, while data-transfer strategies have nuanced effects and JIT overhead varies by compiler; overall, DPC++ approaches CUDA performance more closely than AdaptiveCpp, though CUDA still outperforms for this specific use case. The work provides practical guidance for SYCL developers, demonstrates the portability benefits of SYCL, and outlines a path for extending GPU migration to more RDataFrame actions.

Abstract

The world's largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame -- histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications.

Lessons Learned Migrating CUDA to SYCL: A HEP Case Study with ROOT RDataFrame

TL;DR

The paper addresses enabling GPU-accelerated data analysis in ROOT/RDataFrame by migrating a core histogramming operation from CUDA to SYCL to support heterogeneous architectures. It documents the migration workflow, implementation choices (including SYCL reductions and USM vs buffers), and a comparative evaluation of AdaptiveCpp and DPC++ against native CUDA. Key findings show that SYCL reductions and multi-kernel fusion can improve performance, while data-transfer strategies have nuanced effects and JIT overhead varies by compiler; overall, DPC++ approaches CUDA performance more closely than AdaptiveCpp, though CUDA still outperforms for this specific use case. The work provides practical guidance for SYCL developers, demonstrates the portability benefits of SYCL, and outlines a path for extending GPU migration to more RDataFrame actions.

Abstract

The world's largest particle accelerator, located at CERN, produces petabytes of data that need to be analysed efficiently, to study the fundamental structures of our universe. ROOT is an open-source C++ data analysis framework, developed for this purpose. Its high-level data analysis interface, RDataFrame, currently only supports CPU parallelism. Given the increasing heterogeneity in computing facilities, it becomes crucial to efficiently support GPGPUs to take advantage of the available resources. SYCL allows for a single-source implementation, which enables support for different architectures. In this paper, we describe a CUDA implementation and the migration process to SYCL, focusing on a core high energy physics operation in RDataFrame -- histogramming. We detail the challenges that we faced when integrating SYCL into a large and complex code base. Furthermore, we perform an extensive comparative performance analysis of two SYCL compilers, AdaptiveCpp and DPC++, and the reference CUDA implementation. We highlight the performance bottlenecks that we encountered, and the methodology used to detect these. Based on our findings, we provide actionable insights for developers of SYCL applications.
Paper Structure (16 sections, 11 figures)

This paper contains 16 sections, 11 figures.

Figures (11)

  • Figure 1: Example of a ROOT 1D histogram.
  • Figure 2: Processing of a histogram action in RDataFrame.
  • Figure 3: Processing of a histogram action in RDataFrame with a GPU. Red = memory transfers, green = GPU execution, and blue = CPU execution.
  • Figure 4: Total time spent on GPU activity in Histo1D with increasing number of elements reduced per work-item using SYCL2020 reductions. Each run processes 1B events in total, with multiple reduction variables in a single SYCL kernel.
  • Figure 5: Total time spent on GPU activity in Histo1D with multiple reduction variables per SYCL kernel (multi) or a single reduction variable per kernel (single). Two elements are reduced per work-item.
  • ...and 6 more figures