Table of Contents
Fetching ...

Characterizing and Understanding Energy Footprint and Efficiency of Small Language Model on Edges

Md Romyull Islam, Bobin Deng, Nobel Dhar, Tu N. Nguyen, Selena He, Yong Shi, Kun Suo

TL;DR

This work investigates the energy footprint of small language models deployed on edge devices by quantizing four SLMs to 4-bit GGUF and evaluating them across Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU). Using real-time power measurements on MMLU, HellaSwag, and Winogrande, the study quantifies accuracy, latency, throughput, and energy-related metrics to reveal power-performance trade-offs. Key findings show that Jetson Orin Nano with GPU delivers the best energy-perfomance, Llama 3.2 offers the best balance of accuracy and power, TinyLlama excels in ultra-low-power contexts, and Phi-3 Mini is generally the least energy-efficient despite high accuracy. The results inform practical deployment guidelines for edge AI, emphasizing hardware acceleration, memory bandwidth, and architecture choices, and point to future directions like adaptive power management and alternative quantization strategies.

Abstract

Cloud-based large language models (LLMs) and their variants have significantly influenced real-world applications. Deploying smaller models (i.e., small language models (SLMs)) on edge devices offers additional advantages, such as reduced latency and independence from network connectivity. However, edge devices' limited computing resources and constrained energy budgets challenge efficient deployment. This study evaluates the power efficiency of five representative SLMs - Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 on Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU configurations). Results show that Jetson Orin Nano with GPU acceleration achieves the highest energy-to-performance ratio, significantly outperforming CPU-based setups. Llama 3.2 provides the best balance of accuracy and power efficiency, while TinyLlama is well-suited for low-power environments at the cost of reduced accuracy. In contrast, Phi-3 Mini consumes the most energy despite its high accuracy. In addition, GPU acceleration, memory bandwidth, and model architecture are key in optimizing inference energy efficiency. Our empirical analysis offers practical insights for AI, smart systems, and mobile ad-hoc platforms to leverage tradeoffs from accuracy, inference latency, and power efficiency in energy-constrained environments.

Characterizing and Understanding Energy Footprint and Efficiency of Small Language Model on Edges

TL;DR

This work investigates the energy footprint of small language models deployed on edge devices by quantizing four SLMs to 4-bit GGUF and evaluating them across Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU). Using real-time power measurements on MMLU, HellaSwag, and Winogrande, the study quantifies accuracy, latency, throughput, and energy-related metrics to reveal power-performance trade-offs. Key findings show that Jetson Orin Nano with GPU delivers the best energy-perfomance, Llama 3.2 offers the best balance of accuracy and power, TinyLlama excels in ultra-low-power contexts, and Phi-3 Mini is generally the least energy-efficient despite high accuracy. The results inform practical deployment guidelines for edge AI, emphasizing hardware acceleration, memory bandwidth, and architecture choices, and point to future directions like adaptive power management and alternative quantization strategies.

Abstract

Cloud-based large language models (LLMs) and their variants have significantly influenced real-world applications. Deploying smaller models (i.e., small language models (SLMs)) on edge devices offers additional advantages, such as reduced latency and independence from network connectivity. However, edge devices' limited computing resources and constrained energy budgets challenge efficient deployment. This study evaluates the power efficiency of five representative SLMs - Llama 3.2, Phi-3 Mini, TinyLlama, and Gemma 2 on Raspberry Pi 5, Jetson Nano, and Jetson Orin Nano (CPU and GPU configurations). Results show that Jetson Orin Nano with GPU acceleration achieves the highest energy-to-performance ratio, significantly outperforming CPU-based setups. Llama 3.2 provides the best balance of accuracy and power efficiency, while TinyLlama is well-suited for low-power environments at the cost of reduced accuracy. In contrast, Phi-3 Mini consumes the most energy despite its high accuracy. In addition, GPU acceleration, memory bandwidth, and model architecture are key in optimizing inference energy efficiency. Our empirical analysis offers practical insights for AI, smart systems, and mobile ad-hoc platforms to leverage tradeoffs from accuracy, inference latency, and power efficiency in energy-constrained environments.

Paper Structure

This paper contains 37 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Full Precision (FP32) GPU Performance Trend over the Years Based on Hardware Price
  • Figure 2: Full Precision (FP32) GPU Performance Trend over the Years Based on Energy Usage
  • Figure 3: Setup of Power Measurement System
  • Figure 4: Prediction Accuracy Per Watt-Hour and Energy Consumption Per Inference Across Devices
  • Figure 5: Comparison of Total Time, Energy Consumption, and Tokens Per Second for Benchmark Tasks Across Edge Devices