Table of Contents
Fetching ...

Code Generation for a Variety of Accelerators for a Graph DSL

Ashwina Kumar, M. Venkata Krishna, Prasanna Bartakke, Rahul Kumar, Rajesh Pandian M, Nibedita Behera, Rupesh Nasre

TL;DR

The paper tackles scalable graph analytics by enabling a graph DSL, StarPlat, to auto-generate parallel code for multiple accelerators from a single specification. It introduces CSR-based graph representation, vertex-centric iteration, and constructs like fixedPoint and Min/Max to express iterative graph algorithms, with backend-specific optimizations for CUDA, SYCL, OpenCL, and OpenACC. The authors evaluate StarPlat on ten large graphs across four algorithms (BC, PR, SSSP, TC), comparing against Gunrock and LonestarGPU to demonstrate competitive performance and portability. The findings indicate that multi-backend code generation is feasible and practical for domain experts, offering portability and reasonable performance without hand-tuning for each platform, though backend maturity and graph types influence relative performance. This work advances portable, high-performance graph analytics by reducing the need for hardware-specific coding while delivering optimized, backend-tailored implementations.

Abstract

Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if programmers and domain-experts get to focus only on the sequential computation and a compiler takes care of auto-generating the parallel code. On the other side, there is a variety in the number of target hardware devices, and achieving optimal performance often demands coding in specific languages or frameworks. Our goal in this work is to focus on a graph DSL which allows the domain-experts to write almost-sequential code, and generate parallel code for different accelerators from the same algorithmic specification. In particular, we illustrate code generation from the StarPlat graph DSL for NVIDIA, AMD, and Intel GPUs using CUDA, OpenCL, SYCL, and OpenACC programming languages. Using a suite of ten large graphs and four popular algorithms, we present the efficacy of StarPlat's versatile code generator.

Code Generation for a Variety of Accelerators for a Graph DSL

TL;DR

The paper tackles scalable graph analytics by enabling a graph DSL, StarPlat, to auto-generate parallel code for multiple accelerators from a single specification. It introduces CSR-based graph representation, vertex-centric iteration, and constructs like fixedPoint and Min/Max to express iterative graph algorithms, with backend-specific optimizations for CUDA, SYCL, OpenCL, and OpenACC. The authors evaluate StarPlat on ten large graphs across four algorithms (BC, PR, SSSP, TC), comparing against Gunrock and LonestarGPU to demonstrate competitive performance and portability. The findings indicate that multi-backend code generation is feasible and practical for domain experts, offering portability and reasonable performance without hand-tuning for each platform, though backend maturity and graph types influence relative performance. This work advances portable, high-performance graph analytics by reducing the need for hardware-specific coding while delivering optimized, backend-tailored implementations.

Abstract

Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be ideal if programmers and domain-experts get to focus only on the sequential computation and a compiler takes care of auto-generating the parallel code. On the other side, there is a variety in the number of target hardware devices, and achieving optimal performance often demands coding in specific languages or frameworks. Our goal in this work is to focus on a graph DSL which allows the domain-experts to write almost-sequential code, and generate parallel code for different accelerators from the same algorithmic specification. In particular, we illustrate code generation from the StarPlat graph DSL for NVIDIA, AMD, and Intel GPUs using CUDA, OpenCL, SYCL, and OpenACC programming languages. Using a suite of ten large graphs and four popular algorithms, we present the efficacy of StarPlat's versatile code generator.
Paper Structure (20 sections, 4 tables)