Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing
Satyam Kumar, Saurabh Jha
TL;DR
The paper addresses the challenge of deploying large-language-model inference on resource-constrained edge devices by introducing QEIL, a framework that unifies inference-time scaling formalisms with heterogeneous hardware orchestration. It identifies five stable scaling relationships for coverage, energy, latency, and cost, and couples them with Energy-Coverage Efficiency and Intelligence Per Watt metrics to enable multi-objective optimization. A safety-first agentic orchestrator assigns inference tasks across CPUs, GPUs, and NPUs while enforcing thermal constraints, fault tolerance, and adversarial robustness. Across five transformer families and three benchmarks, QEIL demonstrates substantial gains in coverage (7–10.5 percentage points), energy reduction (47–78%), and IPW (2.1–5.6×) with robust safety guarantees, supporting practical edge deployments. The work bridges datacenter heterogeneous orchestration with edge safety-critical requirements, offering a scalable path toward reliable, energy-efficient edge AI.
Abstract
Deploying large language models (LLMs) on resource constrained edge devices is limited by a poor understanding of inference time scaling on heterogeneous hardware. We present QEIL (Quantifying Edge Intelligence via Inference time Scaling Formalisms), a unified framework to characterize and optimize inference across CPUs, GPUs, and NPUs. QEIL reveals stable power law scaling behavior in latency, energy, and task coverage for transformer models ranging from 125M to 2.6B parameters, and demonstrates that heterogeneous orchestration with intelligent coordination across mixed accelerators consistently improves energy efficiency and coverage compared to homogeneous execution. QEIL introduces three composite metrics: Intelligence per Watt, Energy Coverage Efficiency, and Price Power Performance, enabling multi objective optimization for edge intelligence. A safety first agentic orchestrator dynamically allocates workloads across same vendor and cross vendor accelerators while enforcing thermal constraints, fault tolerant execution, adversarial input validation, and continuous hardware health monitoring. Evaluations across five model families show that QEIL achieves consistent improvements in efficiency, latency, and coverage without sacrificing accuracy or system safety, establishing inference time scaling and heterogeneous orchestration as key foundations for reliable edge AI.
