Table of Contents
Fetching ...

A high-performance and portable implementation of the SISSO method for CPUs and GPUs

Sebastian Eibl, Yi Yao, Matthias Scheffler, Markus Rampp, Luca M. Ghiringhelli, Thomas A. R. Purcell

TL;DR

The paper addresses the hardware-diverse landscape of modern HPC by porting the SISSO++ framework to GPUs using Kokkos, enabling a single, performance-portable codebase across Nvidia and AMD devices while retaining MPI+OpenMP parallelism. It targets the three bottlenecks of SISSO—feature generation, SIS screening, and l0-regularization—through fused GPU kernels, batched solvers, and auto-tuning with mixed-precision options. Benchmark results on Nvidia A100, AMD MI250, and MI300A demonstrate substantial single-node speedups (up to ~6x) and solid multi-node strong scaling, with porting achieving cross-vendor portability. The work supports larger, ensemble-based symbolic regression workflows for active learning, and provides open-source access under Apache 2.0.

Abstract

SISSO (sure-independence screening and sparsifying operator) is an artificial intelligence (AI) method based on symbolic regression and compressed sensing widely used in materials science research. SISSO++ is its C++ implementation that employs MPI and OpenMP for parallelization, rendering it well-suited for high-performance computing (HPC) environments. As heterogeneous hardware becomes mainstream in the HPC and AI fields, we chose to port the SISSO++ code to GPUs using the Kokkos performance-portable library. Kokkos allows us to maintain a single codebase for both Nvidia and AMD GPUs, significantly reducing the maintenance effort. In this work, we summarize the necessary code changes we did to achieve hardware and performance portability. This is accompanied by performance benchmarks on Nvidia and AMD GPUs. We demonstrate the speedups obtained from using GPUs across the three most time-consuming parts of our code.

A high-performance and portable implementation of the SISSO method for CPUs and GPUs

TL;DR

The paper addresses the hardware-diverse landscape of modern HPC by porting the SISSO++ framework to GPUs using Kokkos, enabling a single, performance-portable codebase across Nvidia and AMD devices while retaining MPI+OpenMP parallelism. It targets the three bottlenecks of SISSO—feature generation, SIS screening, and l0-regularization—through fused GPU kernels, batched solvers, and auto-tuning with mixed-precision options. Benchmark results on Nvidia A100, AMD MI250, and MI300A demonstrate substantial single-node speedups (up to ~6x) and solid multi-node strong scaling, with porting achieving cross-vendor portability. The work supports larger, ensemble-based symbolic regression workflows for active learning, and provides open-source access under Apache 2.0.

Abstract

SISSO (sure-independence screening and sparsifying operator) is an artificial intelligence (AI) method based on symbolic regression and compressed sensing widely used in materials science research. SISSO++ is its C++ implementation that employs MPI and OpenMP for parallelization, rendering it well-suited for high-performance computing (HPC) environments. As heterogeneous hardware becomes mainstream in the HPC and AI fields, we chose to port the SISSO++ code to GPUs using the Kokkos performance-portable library. Kokkos allows us to maintain a single codebase for both Nvidia and AMD GPUs, significantly reducing the maintenance effort. In this work, we summarize the necessary code changes we did to achieve hardware and performance portability. This is accompanied by performance benchmarks on Nvidia and AMD GPUs. We demonstrate the speedups obtained from using GPUs across the three most time-consuming parts of our code.

Paper Structure

This paper contains 14 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the SISSO method. a) An example of the feature creation step of SISSO with two primary features $x_1$ (light orange) and $x_2$ (purple) and the multiplication operator (dark green). b) A flowchart of the algorithm used for the descriptor identification step. c) A comparison of the maximum number of features generated for the multiplication operator with respect to the number of primary features for rungs 1 (red triangle), 2 (blue plus), 3 (gray star), 4 (lavender hexagon), and 5 (green x). d) A comparison of the number of models evaluated against the size of the SIS subspace for a 1 (orange circle), 2 (brown square), 3 (blue diamond), 4 (purple triangles), and 5 (green pentagons) dimensional model.
  • Figure 2: CPU and GPU algorithms for feature generation
  • Figure 3: A comparison of the total run time (left column, panels a, c) with a breakdown into the major algorithmic parts (right column, panels b, d) for the thermal conductivity (top row, panels a, b) and Kaggle Competition Test Case (bottom row, panels c, d). The benchmark was executed on the Nvidia A100 platform (Raven). The gray bar is drawn for CPU-only calculations (MKL-based baseline), the green bar represents GPU-enabled calculations in double precision, the brown bar is for a hybrid CPU-GPU setup for $\ell_0$ regularization, the "x" hatched bars are the corresponding results for single-precision (FP32). Bars involving GPU calculations are labelled by the obtained speedup relative to the CPU-only reference configuration, defined as the ratio $t_\mathrm{GPU} / t_\mathrm{CPU}$. The hybrid setup is not shown for a and b because those calculations cannot be efficiently run with more than one thread per MPI rank currently.
  • Figure 4: Runtime as a function of the number of compute nodes on the A100 platform (Raven) with the thermal conductivity test case executed on a the a) pre-optimized version, b) CPU-only, and c) GPU-accelerated code. The total runtime (maroon squares) is shown together with a breakdown into the code parts FC (gray diamonds), SIS (gold diamonds), and $\ell_0$-regularization (purple circles). Ideal strong scaling is indicated by dashed lines.
  • Figure 5: A comparison of the total run time (left column, panels a, c) with a breakdown into the major algorithmic parts (right column, panels b, d) for the thermal conductivity (top row, panels a, b) and $\ell_0$ (bottom row, panels c, d) benchmarks, executed on a single node with different hardware setups: The green, blue, and red bars show calculation times on Nvidia A100 GPUs, and AMD MI250 and MI300A GPUs, respectively, and the "x" hatched bars show single-precision (FP32) results for the same platforms. For A100 and MI250 results there were 4 GPUs per node and MI300A has 2 GPUs per node.