Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Tim Trappen; Robert Keßler; Roland Pabel; Viktor Achter; Stefan Wesner

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

Tim Trappen, Robert Keßler, Roland Pabel, Viktor Achter, Stefan Wesner

TL;DR

This paper tackles the mismatch between static HPC resource models and the dynamic, user-facing demands of LLM inference in higher education. It introduces a two-layer architecture that couples a Kubernetes-managed web API with Slurm-backed vLLM endpoints, running on Apptainer within RAMSES, and coordinated via a central PostgreSQL database and an observability stack (Prometheus, Grafana, Loki). Key contributions include the integration of vLLM with Slurm for multi-GPU, cross-node inference, a modular web gateway for OpenAI-compatible requests, and an automated dynamic-scaling loop driven by GPU load and Grafana alerts. Preliminary benchmarks demonstrate the approach can handle 100, 500, and 1000 concurrent requests with roughly 500 ms end-to-end latency overhead, illustrating the viability of sovereign, on-prem AI inference in an HPC setting and informing future work on caching, networking, and broader modality support.

Abstract

Due to rising demands for Artificial Inteligence (AI) inference, especially in higher education, novel solutions utilising existing infrastructure are emerging. The utilisation of High-Performance Computing (HPC) has become a prevalent approach for the implementation of such solutions. However, the classical operating model of HPC does not adapt well to the requirements of synchronous, user-facing dynamic AI application workloads. In this paper, we propose our solution that serves LLMs by integrating vLLM, Slurm and Kubernetes on the supercomputer \textit{RAMSES}. The initial benchmark indicates that the proposed architecture scales efficiently for 100, 500 and 1000 concurrent requests, incurring only an overhead of approximately 500 ms in terms of end-to-end latency.

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

TL;DR

Abstract

Automated Dynamic AI Inference Scaling on HPC-Infrastructure: Integrating Kubernetes, Slurm and vLLM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)