The AI_INFN Platform: Artificial Intelligence Development in the Cloud

Lucio Anderlini; Giulio Bianchini; Diego Ciangottini; Stefano Dal Pra; Diego Michelotto; Rosa Petrini; Daniele Spiga

The AI_INFN Platform: Artificial Intelligence Development in the Cloud

Lucio Anderlini, Giulio Bianchini, Diego Ciangottini, Stefano Dal Pra, Diego Michelotto, Rosa Petrini, Daniele Spiga

TL;DR

The paper addresses the challenge of coordinating access to hardware accelerators for ML across development, testing, and production environments within INFN. It presents AI_INFN, a Kubernetes-based SaaS platform deployed in the INFN Cloud that enables scalable, GPU-accelerated AI workflows and跨-site collaboration using cross-provider offloading. Key innovations include the use of the NVIDIA GPU Operator for MIG-enabled GPUs, Virtual Kubelet with the InterLink API for federated resource usage, and a Snakemake-driven workflow layer with a local Kueue batch system. The architecture combines JupyterHub, templated environments (Conda, Apptainer, OCI images), robust storage and backup, and Prometheus/Grafana-based monitoring, demonstrated through multi-site benchmarks and case studies to validate performance, scalability, and integration. The work offers a practical path toward unified, cloud-native AI infrastructure in INFN, enabling efficient sharing of accelerators and supporting diverse research domains.

Abstract

Machine Learning (ML) is profoundly reshaping the way researchers create, implement, and operate data-intensive software. Its adoption, however, introduces notable challenges for computing infrastructures, particularly when it comes to coordinating access to hardware accelerators across development, testing, and production environments. The INFN initiative AI_INFN (Artificial Intelligence at INFN) seeks to promote the use of ML methods across various INFN research scenarios by offering comprehensive technical support, including access to AI-focused computational resources. Leveraging the INFN Cloud ecosystem and cloud-native technologies, the project emphasizes efficient sharing of accelerator hardware while maintaining the breadth of the Institute's research activities. This contribution describes the deployment and commissioning of a Kubernetes-based platform designed to simplify GPU-powered data analysis workflows and enable their scalable execution on heterogeneous distributed resources. By integrating offloading mechanisms through Virtual Kubelet and the InterLink API, the platform allows workflows to span multiple resource providers, from Worldwide LHC Computing Grid sites to high-performance computing centers like CINECA Leonardo. We will present preliminary benchmarks, functional tests, and case studies, demonstrating both performance and integration outcomes.

The AI_INFN Platform: Artificial Intelligence Development in the Cloud

TL;DR

Abstract

The AI_INFN Platform: Artificial Intelligence Development in the Cloud

TL;DR

Abstract

Paper Structure

Table of Contents