Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Ali Doosthosseini; Jonathan Decker; Hendrik Nolte; Julian M. Kunkel

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Ali Doosthosseini, Jonathan Decker, Hendrik Nolte, Julian M. Kunkel

TL;DR

This work tackles the challenge of providing private, real-time LLM serving on Slurm-based HPC clusters, where traditional batch scheduling is ill-suited for interactive web services. The authors introduce Chat AI, a Slurm-native architecture that links a cloud-hosted web frontend to an HPC-backed LLM backend through an SSH-forced, circuit-breaker-enabled proxy, a scheduler, and an OpenAI-compatible vLLM runtime, all behind a Kong API gateway. Key contributions include a production-ready, privacy-focused design that avoids storing user prompts server-side, a robust security model with defense-in-depth, and demonstrable adoption across hundreds of institutions, highlighting the practical viability of private LLM inference on existing HPC infrastructure. The solution demonstrates that privacy-preserving, HPC-backed LLM serving is achievable at scale for academic and research contexts, enabling rapid, secure access to open-source models while maintaining data sovereignty.

Abstract

The widespread adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs are well-suited for training LLMs, their batch scheduling paradigm is not designed to support real-time serving of AI applications. Cloud systems, on the other hand, are well suited for web services but commonly lack access to the computational power of HPC clusters, especially expensive and scarce high-end GPUs, which are required for optimal inference speed. We propose an architecture with an implementation consisting of a web service that runs on a cloud VM with secure access to a scalable backend running a multitude of LLM models on HPC systems. By offering a web service using our HPC infrastructure to host LLMs, we leverage the trusted environment of local universities and research centers to offer a private and secure alternative to commercial LLM services. Our solution natively integrates with the HPC batch scheduler Slurm, enabling seamless deployment on HPC clusters, and is able to run side by side with regular Slurm workloads, while utilizing gaps in the schedule created by Slurm. In order to ensure the security of the HPC system, we use the SSH ForceCommand directive to construct a robust circuit breaker, which prevents successful attacks on the web-facing server from affecting the cluster. We have successfully deployed our system as a production service, and made the source code available at \url{https://github.com/gwdg/chat-ai}

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

TL;DR

Abstract

Paper Structure (47 sections, 5 figures, 2 tables)

This paper contains 47 sections, 5 figures, 2 tables.

Introduction
Background
Slurm
Scheduling Paradigm
vLLM
Kong
Security Aspect
Related Work
Challenges
Paradigm Differences
Security
Performance
Scalability
Reliability
Privacy
...and 32 more sections

Figures (5)

Figure 1: Architecture of Chat AI. This diagram displays the main components of the service, consisting of an ESX web server that communicates to the login/service node, and the compute nodes of the HPC KISSKI platform.
Figure 2: Chat AI App. This shows the Chat AI web interface written with React and Vite with the chat history on the left, the prompt window on the top right and a drop down for model selection at the bottom right.
Figure 3: Total number of distinct users from Feburary 22nd until July 30th 2024. The total number of users has grown consistently since its initial release, with a slight jump following a university-wide advertisement on April 8th.
Figure 4: Daily Chat AI users from February 22nd until July 30th 2024. New users, are users that used the Chat AI service for the first time on a given day, while Daily users are returning users.
Figure 5: Total inference requests per day from February 22nd until July 30th 2024. This shows the growth in popularity of Chat AI via the number of requests per day as well as when significant models were added to the service. Moreover, each bar shows the amount of requests, which were handled via internal models, those hosted on our own infrastructure as opposed to external models, consisting of OpenAI's GPT3.5 and GPT4.

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

TL;DR

Abstract

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Authors

TL;DR

Abstract

Table of Contents

Figures (5)