Table of Contents
Fetching ...

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

Pedro Antunes, Ana Rita Ortigoso, Gabriel Vieira, Daniel Fuentes, Luís Frazão, Nuno Costa, António Pereira

TL;DR

AIvailable addresses the challenge of deploying LLM inference on heterogeneous and legacy GPUs in resource-constrained environments by introducing a GPU-centric, software-defined LLM-as-a-Service platform. It articulates a four-component architecture—Client Interface, Service Frontend, SDAI Controller, and Service Backend—coupled with VRAM-aware, dynamic model placement to achieve high availability without CPU fallbacks. A working prototype demonstrates feasibility across mixed ROCm/CUDA hardware, employing OpenWebUI for clients, HAProxy for routing, and Ollama-based workloads. The approach aims to democratize access to generative AI for academic labs and SMEs by providing a unified interface to diverse LLM instances. Future work includes quantitative benchmarking, large-scale validation, and enhanced fault tolerance to support broader adoption.

Abstract

The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node's VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.

AIvailable: A Software-Defined Architecture for LLM-as-a-Service on Heterogeneous and Legacy GPUs

TL;DR

AIvailable addresses the challenge of deploying LLM inference on heterogeneous and legacy GPUs in resource-constrained environments by introducing a GPU-centric, software-defined LLM-as-a-Service platform. It articulates a four-component architecture—Client Interface, Service Frontend, SDAI Controller, and Service Backend—coupled with VRAM-aware, dynamic model placement to achieve high availability without CPU fallbacks. A working prototype demonstrates feasibility across mixed ROCm/CUDA hardware, employing OpenWebUI for clients, HAProxy for routing, and Ollama-based workloads. The approach aims to democratize access to generative AI for academic labs and SMEs by providing a unified interface to diverse LLM instances. Future work includes quantitative benchmarking, large-scale validation, and enhanced fault tolerance to support broader adoption.

Abstract

The rise of Large Language Models (LLM) has increased the need for scalable, high-performance inference systems, yet most existing frameworks assume homogeneous, resource-rich hardware, often unrealistic in academic, or resource-constrained settings. We introduce AIvailable, a low-cost, highly available LLM-as-a-Service (LLMaaS) platform, that uses a software-defined approach for running LLMs across heterogeneous and legacy GPU nodes, including NVIDIA and AMD devices, with a focus on fully utilizing each node's VRAM. AIvailable operates as a fully GPU-accelerated inference without CPU fallbacks, featuring a unified client interface that allows seamless interaction with all deployed LLMs through a single logical unit. The architecture comprises four main components: the Client Interface for user access, the Service Frontend for secure request routing and load balancing, the SDAI Controller for orchestration, deployment, and monitoring, and the Service Backend of heterogeneous GPU nodes executing workloads. By abstracting GPU-specific details and providing dynamic, VRAM-aware allocation and reallocation of models, AIvailable ensures efficient use of resources and resilience against failures or workload fluctuations. Targeting academic labs, private companies, and other constrained organizations, it supports diverse open LLMs helping democratize generative AI through the repurposing of legacy GPUs.

Paper Structure

This paper contains 10 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: AIvailable Architecture
  • Figure 2: AIvailable Prototype
  • Figure 3: SDAI Interface Dashboard
  • Figure 4: SDAI Configuration Wizard
  • Figure 5: SDAI GPU Selection
  • ...and 3 more figures