Table of Contents
Fetching ...

Syndeo: Portable Ray Clusters with Secure Containerization

William Li, Rodney S. Lafuente Mercado, Jaime D. Pena, Ross E. Allen

TL;DR

Syndeo tackles the incompatibility between Slurm and Ray by embedding a Ray cluster inside Slurm-managed resources and containerizing the entire stack with Apptainer/Singularity. This approach achieves cross-architecture portability, enabling deployment on on-premises Slurm or cloud environments via Kubernetes without rewriting code. The framework demonstrates scalable, secure execution of Ray workloads in multi-tenant HPC settings, with unprivileged user profiles and container isolation improving security. Practically, Syndeo enables researchers to run modern AI workflows across diverse infrastructures with near-linear throughput scaling and without scheduler- or containerization-rewriting overhead.

Abstract

We present Syndeo: a software framework for container orchestration of Ray on Slurm. In general the idea behind Syndeo is to write code once and deploy anywhere. Specifically, Syndeo is designed to addresses the issues of portability, scalability, and security for parallel computing. The design is portable because the containerized Ray code can be re-deployed on Amazon Web Services, Microsoft Azure, Google Cloud, or Alibaba Cloud. The process is scalable because we optimize for multi-node, high-throughput computing. The process is secure because users are forced to operate with unprivileged profiles meaning administrators control the access permissions. We demonstrate Syndeo's portable, scalable, and secure design by deploying containerized parallel workflows on Slurm for which Ray does not officially support.

Syndeo: Portable Ray Clusters with Secure Containerization

TL;DR

Syndeo tackles the incompatibility between Slurm and Ray by embedding a Ray cluster inside Slurm-managed resources and containerizing the entire stack with Apptainer/Singularity. This approach achieves cross-architecture portability, enabling deployment on on-premises Slurm or cloud environments via Kubernetes without rewriting code. The framework demonstrates scalable, secure execution of Ray workloads in multi-tenant HPC settings, with unprivileged user profiles and container isolation improving security. Practically, Syndeo enables researchers to run modern AI workflows across diverse infrastructures with near-linear throughput scaling and without scheduler- or containerization-rewriting overhead.

Abstract

We present Syndeo: a software framework for container orchestration of Ray on Slurm. In general the idea behind Syndeo is to write code once and deploy anywhere. Specifically, Syndeo is designed to addresses the issues of portability, scalability, and security for parallel computing. The design is portable because the containerized Ray code can be re-deployed on Amazon Web Services, Microsoft Azure, Google Cloud, or Alibaba Cloud. The process is scalable because we optimize for multi-node, high-throughput computing. The process is secure because users are forced to operate with unprivileged profiles meaning administrators control the access permissions. We demonstrate Syndeo's portable, scalable, and secure design by deploying containerized parallel workflows on Slurm for which Ray does not officially support.
Paper Structure (16 sections, 5 figures, 4 tables)

This paper contains 16 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: In the Slurm paradigm jobs are sent to homogeneous worker nodes with the data it needs to process. When the worker nodes are done processing their jobs, synchronization points are used to aggregate the data. Ray operates with a different paradigm when running jobs. In Ray, each job is an abstraction that cannot start unless all of its dependencies are met. Dependencies can be sourced from heterogeneous worker nodes (computing resources) or data from the Global Object Store. Jobs can push artifacts to the Global Object Store which may be dependencies for other jobs. Note that this is a simplified description of Slurm and Ray. Slurm offers multiple plugins which can change how it processes jobs.
  • Figure 2: Syndeo starts container orchestration by providing a copy of the container to all allocated nodes. One head node will be assigned and the rest will be worker nodes. Each container is preconfigured with Ray and the user's algorithm(s). At runtime, Syndeo initializes Ray on all containers and checks for network connectivity. If all the Ray containers successfully connect, they form a Ray Cluster. The Ray Cluster allows users to submit jobs and will execute them on the Ray scheduler. Syndeo offers a simple container orchestration method that is compatible with Slurm.
  • Figure 3: All environments tested with their corresponding CPU configuration and throughput values.
  • Figure 4: Ideal versus measured performance with 2$\sigma$ standard deviation bands. Here we assume that ideal is an extrapolation of the 28 CPU worker throughput. As more CPU workers are added, the communication costs increase which degrades performance.
  • Figure 5: Ideal versus measured performance with 2$\sigma$ standard deviation bands. Here we assume that ideal is an extrapolation of the 28 CPU worker throughput. As more CPU workers are added, the communication costs increase which degrades performance.