Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

Satyam Kumar; Saurabh Jha

Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

Satyam Kumar, Saurabh Jha

TL;DR

The paper addresses the challenge of deploying large-language-model inference on resource-constrained edge devices by introducing QEIL, a framework that unifies inference-time scaling formalisms with heterogeneous hardware orchestration. It identifies five stable scaling relationships for coverage, energy, latency, and cost, and couples them with Energy-Coverage Efficiency and Intelligence Per Watt metrics to enable multi-objective optimization. A safety-first agentic orchestrator assigns inference tasks across CPUs, GPUs, and NPUs while enforcing thermal constraints, fault tolerance, and adversarial robustness. Across five transformer families and three benchmarks, QEIL demonstrates substantial gains in coverage (7–10.5 percentage points), energy reduction (47–78%), and IPW (2.1–5.6×) with robust safety guarantees, supporting practical edge deployments. The work bridges datacenter heterogeneous orchestration with edge safety-critical requirements, offering a scalable path toward reliable, energy-efficient edge AI.

Abstract

Deploying large language models (LLMs) on resource constrained edge devices is limited by a poor understanding of inference time scaling on heterogeneous hardware. We present QEIL (Quantifying Edge Intelligence via Inference time Scaling Formalisms), a unified framework to characterize and optimize inference across CPUs, GPUs, and NPUs. QEIL reveals stable power law scaling behavior in latency, energy, and task coverage for transformer models ranging from 125M to 2.6B parameters, and demonstrates that heterogeneous orchestration with intelligent coordination across mixed accelerators consistently improves energy efficiency and coverage compared to homogeneous execution. QEIL introduces three composite metrics: Intelligence per Watt, Energy Coverage Efficiency, and Price Power Performance, enabling multi objective optimization for edge intelligence. A safety first agentic orchestrator dynamically allocates workloads across same vendor and cross vendor accelerators while enforcing thermal constraints, fault tolerant execution, adversarial input validation, and continuous hardware health monitoring. Evaluations across five model families show that QEIL achieves consistent improvements in efficiency, latency, and coverage without sacrificing accuracy or system safety, establishing inference time scaling and heterogeneous orchestration as key foundations for reliable edge AI.

Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

TL;DR

Abstract

Paper Structure (60 sections, 12 equations, 6 figures, 16 tables)

This paper contains 60 sections, 12 equations, 6 figures, 16 tables.

Introduction
Related Work
Inference-Time Scaling and Repeated Sampling
Intelligence Efficiency and Local-Cloud Hybrid Systems
Heterogeneous Computing and Cost-Aware Orchestration
Energy-Efficient Edge Deployment and Real-World Constraints
AI Safety, Reliability, and Fault-Tolerant Systems
Limitations of Training-Time Scaling and the Case for Inference-Time Optimization
Reinforcement Learning Scaling and Inference-Time Reasoning
Distributed Inference and Disaggregated Processing
Sparse Models and Mixture of Experts
Scaling Relationships and Training-Time Compute Efficiency
Compiler Infrastructure for Heterogeneous Targets
Transformer Architectures and Reasoning at Inference Time
Federated and Privacy-Preserving Learning at the Edge
...and 45 more sections

Figures (6)

Figure 1: QEIL (Quantifying Edge Intelligence via Inference-time Scaling Formalisms) Framework Architecture. Left panel shows model and device specifications as inputs. Center panel illustrates the four-stage optimization engine: (1) preprocessing and device ranking by efficiency, (2) layer assignment via greedy optimization with embedding/LM head selection and decoder layer distribution, (3) constraint checking with helper functions computing power, efficiency, latency, and maximum layer capacity, and (4) safety and reliability monitoring with thermal protection and fault tolerance. Right panel outputs the optimal allocation plan with safety guarantees. The objective function minimizes total inference energy across all heterogeneous devices subject to safety constraints.
Figure 2: Total Energy Consumption Comparison between Standard (homogeneous GPU) and Energy-Aware (heterogeneous QEIL) execution modes on GPT-2 (125M) with $S=20$ samples. Standard execution consumes 43,057.7 J while Energy-Aware execution achieves 22,487.8 J, representing a 47.8% reduction in total energy consumption through intelligent heterogeneous orchestration.
Figure 3: Latency Breakdown Comparison between CPU-Only and CPU-GPU-NPU (Heterogeneous) execution modes. CPU-Only execution requires 20.7ms total (dominated by compute time at $\sim$18ms), while heterogeneous orchestration achieves 8.6ms total through parallel execution across specialized hardware, representing a 58.5% latency reduction.
Figure 4: Real-Time Task Manager Visualization of QEIL Dynamic Orchestrator during heterogeneous inference execution on GPT-2 (125M). The snapshot demonstrates simultaneous utilization across multiple processing units: CPU at 9% (2.04 GHz) handling orchestration and lightweight operations, Intel AI Boost NPU at 44% executing memory-bound decode phases, Intel Graphics GPU at 95% processing compute-intensive prefill stages, and NVIDIA RTX PRO 5000 GPU at 21% (57°C) handling overflow compute tasks. This multi-vendor, multi-device parallel execution exemplifies QEIL's agentic orchestration capabilities in resource-constrained edge environments. Note the GPU temperature of 57°C, well below the 85°C thermal throttling threshold, demonstrating safe thermal operation.
Figure 5: Multi-sample aggregation efficiency and coverage improvements across models, demonstrating that heterogeneous orchestration enables superior pass@k coverage gains (7--10.5 percentage points) while maintaining computational stability. Energy-aware execution achieves 66.5%--70.0% coverage across all model families versus 56%--63% for standard homogeneous inference, illustrating that device-specific optimization enables more effective sample diversity. Smaller models (GPT-2, Qwen2) with lower baseline coverage achieve larger absolute improvements, consistent with logarithmic scaling dynamics where initial samples provide highest marginal information content.
...and 1 more figures

Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

TL;DR

Abstract

Quantifying Edge Intelligence: Inference-Time Scaling Formalisms for Heterogeneous Computing

Authors

TL;DR

Abstract

Table of Contents

Figures (6)