Syndeo: Portable Ray Clusters with Secure Containerization
William Li, Rodney S. Lafuente Mercado, Jaime D. Pena, Ross E. Allen
TL;DR
Syndeo tackles the incompatibility between Slurm and Ray by embedding a Ray cluster inside Slurm-managed resources and containerizing the entire stack with Apptainer/Singularity. This approach achieves cross-architecture portability, enabling deployment on on-premises Slurm or cloud environments via Kubernetes without rewriting code. The framework demonstrates scalable, secure execution of Ray workloads in multi-tenant HPC settings, with unprivileged user profiles and container isolation improving security. Practically, Syndeo enables researchers to run modern AI workflows across diverse infrastructures with near-linear throughput scaling and without scheduler- or containerization-rewriting overhead.
Abstract
We present Syndeo: a software framework for container orchestration of Ray on Slurm. In general the idea behind Syndeo is to write code once and deploy anywhere. Specifically, Syndeo is designed to addresses the issues of portability, scalability, and security for parallel computing. The design is portable because the containerized Ray code can be re-deployed on Amazon Web Services, Microsoft Azure, Google Cloud, or Alibaba Cloud. The process is scalable because we optimize for multi-node, high-throughput computing. The process is secure because users are forced to operate with unprivileged profiles meaning administrators control the access permissions. We demonstrate Syndeo's portable, scalable, and secure design by deploying containerized parallel workflows on Slurm for which Ray does not officially support.
