LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli; Sahil Arayakandy; Ritam Pal; Kautuk Kundan

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan

Abstract

Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Abstract

Paper Structure (23 sections, 5 figures, 10 tables)

This paper contains 23 sections, 5 figures, 10 tables.

Introduction
Background
Quantised Inference on Edge Hardware
Autoregressive Decoding Phases
Thermal Throttling in Mobile SoCs
Methodology
Platforms and Inference Stacks
Model
Prompt and Generation Configuration
Metrics
Experimental Protocol
Results
NVIDIA RTX 4050 (Desktop GPU)
Raspberry Pi 5 + Hailo-10H NPU
iPhone 16 Pro (iOS / MLX)
...and 8 more sections

Figures (5)

Figure 1: Experimental protocol flowchart. Each platform undergoes this procedure independently. Cold-start prefill-time from the warm-up inference is recorded separately and excluded from throughput analysis.
Figure 2: RTX 4050 per-run throughput (left axis) and GPU/CPU temperature (right axis) across runs 2--20. Mean throughput of 131.70 tok/s (CV = 2.2%) confirms stable battery-throttled performance; GPU rises from 55°C to 70°C with no throttling observed.
Figure 3: RPi 5 + Hailo-10H throughput (line, left axis) and NPU/CPU temperature (lines, right axis) across runs 2--20. Throughput CV of 0.04% and stable temperatures confirm no throttling at any point.
Figure 4: iPhone 16 Pro per-iteration throughput across 20 iterations. Bar shading indicates thermal state: light (Normal, iter 1--2), hatched (Warm, iter 3--7), dark (Hot, iter 8--20). Dashed line marks the sustained Hot-state mean of 22.56 tok/s.
Figure 5: Samsung S24 Ultra per-iteration throughput and temperature. Valid iterations 1--5 (solid line) are used for analysis. Iteration 6 (open marker, dashed) is excluded: the Android thermal governor floored GPU frequency to 231 MHz, with GPU temperature reaching 78.3°C.

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Abstract

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

Authors

Abstract

Table of Contents

Figures (5)