An Evaluation of LLMs Inference on Popular Single-board Computers
Tung, Nguyen, Tuyen Nguyen
TL;DR
This paper investigates the feasibility of on-device LLM inference on affordable single-board computers by benchmarking 25 quantized open-source models across three SBCs (Raspberry Pi 4, Raspberry Pi 5, Orange Pi 5 Pro) using two runtimes (Ollama and Llamafile). It systematically measures token throughput, memory footprint, and power under varied CPU configurations with three prompting workloads. Key findings show SBCs can support models up to 1.5B parameters, with Llamafile delivering 3–4× higher throughput and 30–40% lower power than Ollama, and distinct architectural bottlenecks guiding deployment choices. The study provides concrete hardware-specific recommendations for edge AI deployment, highlighting the trade-offs between model size, latency, and energy efficiency, and outlines directions for future optimization in dynamic quantization and hardware-aware scheduling.
Abstract
The growing demand for on-device large language model (LLM) inference is driving interest in deploying lightweight, cost-effective AI solutions on edge hardware. Single-board computers (SBCs) such as the Raspberry Pi and Orange Pi offer a promising platform for localized, privacy-preserving inference-but remain underexplored in the context of LLM workloads. In this work, we benchmark the performance of 25 quantized open-source LLMs across three SBCs-Raspberry Pi 4, Raspberry Pi 5, and Orange Pi 5 Pro-using two inference runtimes: Ollama and Llamafile. We evaluate generation throughput, memory usage, and power consumption under varying CPU configurations, using multiple prompt types to simulate realistic workloads. Our results show that SBCs can reliably support models up to 1.5B parameters, with Llamafile achieving up to 4x higher throughput and 30-40% lower power usage than Ollama. We identify architecture-specific bottlenecks, highlight runtime-level trade-offs, and provide practical deployment recommendations. This study offers the first broad evaluation of LLM inference on SBCs, bridging the gap between high-performance language models and affordable edge computing.
