Table of Contents
Fetching ...

SRE-Llama -- Fine-Tuned Meta's Llama LLM, Federated Learning, Blockchain and NFT Enabled Site Reliability Engineering(SRE) Platform for Communication and Networking Software Services

Eranga Bandara, Safdar H. Bouk, Sachin Shetty, Ravi Mukkamala, Abdul Rahman, Peter Foytik, Ross Gore, Xueping Liang, Ng Wee Keong, Kasun De Zoysa

TL;DR

SRE-Llama addresses the challenge of defining and maintaining SLIs/SLOs in cloud-native communication software by marrying Federated Learning, blockchain governance, NFT-based provenance, and Generative AI via a fine-tuned Llama-3 model. The six-layer architecture enables secure data storage, privacy-preserving model training, and automated SLO generation with Prometheus-compatible alerting, all anchored by NFT-encoded SLIs/SLOs on a blockchain. Key innovations include a coordinator-less FL system, a novel s-528 NFT schema for SLI/SLO tokens, and Llama-3-driven SLO/alert synthesis guided by PromQL. The proposed prototype, demonstrated on a customized Open5GS 5G Core, shows promise for scalable, auditable, and proactive SRE in modern networks, with future work focusing on expanding LLM participation and open-source integrations.

Abstract

Software services are crucial for reliable communication and networking; therefore, Site Reliability Engineering (SRE) is important to ensure these systems stay reliable and perform well in cloud-native environments. SRE leverages tools like Prometheus and Grafana to monitor system metrics, defining critical Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for maintaining high service standards. However, a significant challenge arises as many developers often lack in-depth understanding of these tools and the intricacies involved in defining appropriate SLIs and SLOs. To bridge this gap, we propose a novel SRE platform, called SRE-Llama, enhanced by Generative-AI, Federated Learning, Blockchain, and Non-Fungible Tokens (NFTs). This platform aims to automate and simplify the process of monitoring, SLI/SLO generation, and alert management, offering ease in accessibility and efficy for developers. The system operates by capturing metrics from cloud-native services and storing them in a time-series database, like Prometheus and Mimir. Utilizing this stored data, our platform employs Federated Learning models to identify the most relevant and impactful SLI metrics for different services and SLOs, addressing concerns around data privacy. Subsequently, fine-tuned Meta's Llama-3 LLM is adopted to intelligently generate SLIs, SLOs, error budgets, and associated alerting mechanisms based on these identified SLI metrics. A unique aspect of our platform is the encoding of generated SLIs and SLOs as NFT objects, which are then stored on a Blockchain. This feature provides immutable record-keeping and facilitates easy verification and auditing of the SRE metrics and objectives. The automation of the proposed platform is governed by the blockchain smart contracts. The proposed SRE-Llama platform prototype has been implemented with a use case featuring a customized Open5GS 5G Core.

SRE-Llama -- Fine-Tuned Meta's Llama LLM, Federated Learning, Blockchain and NFT Enabled Site Reliability Engineering(SRE) Platform for Communication and Networking Software Services

TL;DR

SRE-Llama addresses the challenge of defining and maintaining SLIs/SLOs in cloud-native communication software by marrying Federated Learning, blockchain governance, NFT-based provenance, and Generative AI via a fine-tuned Llama-3 model. The six-layer architecture enables secure data storage, privacy-preserving model training, and automated SLO generation with Prometheus-compatible alerting, all anchored by NFT-encoded SLIs/SLOs on a blockchain. Key innovations include a coordinator-less FL system, a novel s-528 NFT schema for SLI/SLO tokens, and Llama-3-driven SLO/alert synthesis guided by PromQL. The proposed prototype, demonstrated on a customized Open5GS 5G Core, shows promise for scalable, auditable, and proactive SRE in modern networks, with future work focusing on expanding LLM participation and open-source integrations.

Abstract

Software services are crucial for reliable communication and networking; therefore, Site Reliability Engineering (SRE) is important to ensure these systems stay reliable and perform well in cloud-native environments. SRE leverages tools like Prometheus and Grafana to monitor system metrics, defining critical Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for maintaining high service standards. However, a significant challenge arises as many developers often lack in-depth understanding of these tools and the intricacies involved in defining appropriate SLIs and SLOs. To bridge this gap, we propose a novel SRE platform, called SRE-Llama, enhanced by Generative-AI, Federated Learning, Blockchain, and Non-Fungible Tokens (NFTs). This platform aims to automate and simplify the process of monitoring, SLI/SLO generation, and alert management, offering ease in accessibility and efficy for developers. The system operates by capturing metrics from cloud-native services and storing them in a time-series database, like Prometheus and Mimir. Utilizing this stored data, our platform employs Federated Learning models to identify the most relevant and impactful SLI metrics for different services and SLOs, addressing concerns around data privacy. Subsequently, fine-tuned Meta's Llama-3 LLM is adopted to intelligently generate SLIs, SLOs, error budgets, and associated alerting mechanisms based on these identified SLI metrics. A unique aspect of our platform is the encoding of generated SLIs and SLOs as NFT objects, which are then stored on a Blockchain. This feature provides immutable record-keeping and facilitates easy verification and auditing of the SRE metrics and objectives. The automation of the proposed platform is governed by the blockchain smart contracts. The proposed SRE-Llama platform prototype has been implemented with a use case featuring a customized Open5GS 5G Core.

Paper Structure

This paper contains 18 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Platform layered architecture.
  • Figure 2: Fine-tune LLM with QLoRA and deploy with Ollama.
  • Figure 3: Proposed large-scale testbed architecture with Ericsson's new RAN, Open5GS 5G-core, and on-prem Llama-3 LLM in VMASC Virginia US.
  • Figure 4: SLO generation prompt.
  • Figure 5: SLO for the secret creation request.
  • ...and 4 more figures