Table of Contents
Fetching ...

Generative AI on the Edge: Architecture and Performance Evaluation

Zeinab Nezami, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi

TL;DR

Investigating computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN concludes that GenAI on the edge offers localized inference in remote or bandwidthconstrained environments in 6 G networks without reliance on cloud infrastructure.

Abstract

6G's AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50\% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.

Generative AI on the Edge: Architecture and Performance Evaluation

TL;DR

Investigating computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN concludes that GenAI on the edge offers localized inference in remote or bandwidthconstrained environments in 6 G networks without reliance on cloud infrastructure.

Abstract

6G's AI native vision of embedding advance intelligence in the network while bringing it closer to the user requires a systematic evaluation of Generative AI (GenAI) models on edge devices. Rapidly emerging solutions based on Open RAN (ORAN) and Network-in-a-Box strongly advocate the use of low-cost, off-the-shelf components for simpler and efficient deployment, e.g., in provisioning rural connectivity. In this context, conceptual architecture, hardware testbeds and precise performance quantification of Large Language Models (LLMs) on off-the-shelf edge devices remains largely unexplored. This research investigates computationally demanding LLM inference on a single commodity Raspberry Pi serving as an edge testbed for ORAN. We investigate various LLMs, including small, medium and large models, on a Raspberry Pi 5 Cluster using a lightweight Kubernetes distribution (K3s) with modular prompting implementation. We study its feasibility and limitations by analyzing throughput, latency, accuracy and efficiency. Our findings indicate that CPU-only deployment of lightweight models, such as Yi, Phi, and Llama3, can effectively support edge applications, achieving a generation throughput of 5 to 12 tokens per second with less than 50\% CPU and RAM usage. We conclude that GenAI on the edge offers localized inference in remote or bandwidth-constrained environments in 6G networks without reliance on cloud infrastructure.

Paper Structure

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: The hardware infrastructure utilized in the testbed, including all key components and their interconnections.
  • Figure 2: Architecture of the K3s Cluster and Deployment of LLMs
  • Figure 3: Prompt Lengths Across Conversations, highlighting the distribution of prompt lengths in various conversations.
  • Figure 4: Throughput performance for prefill and decode stages across models
  • Figure 5: The percentage of time each model allocates to key phases—Prefill, Decode, and Total Time
  • ...and 2 more figures