Table of Contents
Fetching ...

Benchmarking Machine Learning Applications on Heterogeneous Architecture using Reframe

Christopher Rae, Joseph K. L. Lee, James Richings, Michele Weiland

TL;DR

The paper addresses the challenge of benchmarking ML workloads on heterogeneous HPC architectures by extending the Reframe framework with a Kubernetes backend. It implements and validates MLPerf Training and MLPerf HPC benchmarks (ResNet-50, DeepCam, CosmoFlow) across CPU, GPU, Graphcore IPU, and Cerebras CS-2 systems at EPCC. The contributions include the Kubernetes-backed Reframe integration, a practical workflow for cross-platform ML benchmarking, and a discussion of porting challenges and performance insights highlighting filesystem and I/O as major factors. The work enables repeatable, portable performance testing for HPC centers and provides open-source tooling to extend benchmarking to other accelerators and workloads.

Abstract

With the rapid increase in machine learning workloads performed on HPC systems, it is beneficial to regularly perform machine learning specific benchmarks to monitor performance and identify issues. Furthermore, as part of the Edinburgh International Data Facility, EPCC currently hosts a wide range of machine learning accelerators including Nvidia GPUs, the Graphcore Bow Pod64 and Cerebras CS-2, which are managed via Kubernetes and Slurm. We extended the Reframe framework to support the Kubernetes scheduler backend, and utilise Reframe to perform machine learning benchmarks, and we discuss the preliminary results collected and challenges involved in integrating Reframe across multiple platforms and architectures.

Benchmarking Machine Learning Applications on Heterogeneous Architecture using Reframe

TL;DR

The paper addresses the challenge of benchmarking ML workloads on heterogeneous HPC architectures by extending the Reframe framework with a Kubernetes backend. It implements and validates MLPerf Training and MLPerf HPC benchmarks (ResNet-50, DeepCam, CosmoFlow) across CPU, GPU, Graphcore IPU, and Cerebras CS-2 systems at EPCC. The contributions include the Kubernetes-backed Reframe integration, a practical workflow for cross-platform ML benchmarking, and a discussion of porting challenges and performance insights highlighting filesystem and I/O as major factors. The work enables repeatable, portable performance testing for HPC centers and provides open-source tooling to extend benchmarking to other accelerators and workloads.

Abstract

With the rapid increase in machine learning workloads performed on HPC systems, it is beneficial to regularly perform machine learning specific benchmarks to monitor performance and identify issues. Furthermore, as part of the Edinburgh International Data Facility, EPCC currently hosts a wide range of machine learning accelerators including Nvidia GPUs, the Graphcore Bow Pod64 and Cerebras CS-2, which are managed via Kubernetes and Slurm. We extended the Reframe framework to support the Kubernetes scheduler backend, and utilise Reframe to perform machine learning benchmarks, and we discuss the preliminary results collected and challenges involved in integrating Reframe across multiple platforms and architectures.
Paper Structure (13 sections, 1 figure, 4 tables)