Table of Contents
Fetching ...

Inference Energy and Latency in AI-Mediated Education: A Learning-per-Watt Analysis of Edge and Cloud Models

Kushal Khemani

Abstract

Immediate feedback is a foundational requirement of effective AI-mediated learning, yet the energy and latency costs of delivering it remain largely unexamined. This study investigates the latency-energy-learning trade-off in AI tutoring through an empirical comparison of two on-device inference configurations of Microsoft Phi-3 Mini (4k-instruct) on an NVIDIA T4 GPU: full-precision FP16 and 4-bit NormalFloat (NF4) quantisation. Both were evaluated under KV-cache-enabled inference across 500 educational prompts spanning five secondary school subject domains. Pedagogical quality was assessed for each of the 1000 generated responses by a hybrid panel of 10 Cambridge International teachers and three frontier AI systems using a four-dimension rubric. We introduce Learning-per-Watt (LpW), a novel metric quantifying pedagogical value per unit of energy over the learner's waiting window. Under realistic deployment, NF4 achieves lower per-inference energy than FP16 (329 J vs. 369 J) but higher latency (13.4 s vs. 9.2 s), yielding a modest FP16 advantage in LpW of 1.33x at a quality difference of 0.19 points. Under cache-disabled inference -- used in offline evaluation but absent from real deployments -- the gap widens to 7.4x, overstating the FP16 advantage by more than fivefold. Quantisation efficiency is hardware-dependent and inference-regime dependent, with significant implications for equitable AI tutoring deployment in low-resource settings.

Inference Energy and Latency in AI-Mediated Education: A Learning-per-Watt Analysis of Edge and Cloud Models

Abstract

Immediate feedback is a foundational requirement of effective AI-mediated learning, yet the energy and latency costs of delivering it remain largely unexamined. This study investigates the latency-energy-learning trade-off in AI tutoring through an empirical comparison of two on-device inference configurations of Microsoft Phi-3 Mini (4k-instruct) on an NVIDIA T4 GPU: full-precision FP16 and 4-bit NormalFloat (NF4) quantisation. Both were evaluated under KV-cache-enabled inference across 500 educational prompts spanning five secondary school subject domains. Pedagogical quality was assessed for each of the 1000 generated responses by a hybrid panel of 10 Cambridge International teachers and three frontier AI systems using a four-dimension rubric. We introduce Learning-per-Watt (LpW), a novel metric quantifying pedagogical value per unit of energy over the learner's waiting window. Under realistic deployment, NF4 achieves lower per-inference energy than FP16 (329 J vs. 369 J) but higher latency (13.4 s vs. 9.2 s), yielding a modest FP16 advantage in LpW of 1.33x at a quality difference of 0.19 points. Under cache-disabled inference -- used in offline evaluation but absent from real deployments -- the gap widens to 7.4x, overstating the FP16 advantage by more than fivefold. Quantisation efficiency is hardware-dependent and inference-regime dependent, with significant implications for equitable AI tutoring deployment in low-resource settings.
Paper Structure (58 sections, 6 equations, 3 figures, 16 tables)

This paper contains 58 sections, 6 equations, 3 figures, 16 tables.

Figures (3)

  • Figure 1: Learning-per-Watt (LpW) distributions for on-device FP16 and NF4 configurations (Phi-3 Mini, NVIDIA T4 GPU, use_cache=True, $n = 500$ prompts per configuration). (a) Frequency histogram of per-prompt LpW values. The NF4 distribution (red) is centred near $1.88 \times 10^{-3}$ (J s)$^{-1}$; the FP16 distribution (blue) near $2.50 \times 10^{-3}$ (J s)$^{-1}$. The distributions partially overlap, in contrast to the complete separation observed under cache-disabled conditions (Appendix C). (b) Per-category box plots showing IQR and median LpW for each subject domain. The FP16--NF4 gap is consistent across all five categories, confirming that deployment precision rather than prompt domain is the primary determinant of LpW.
  • Figure 2: Sensitivity and scenario analysis. (a) FP16/NF4 ratio under four alternative composite metrics (use_cache=True). QpJ favours NF4 ($0.91\times$), reflecting NF4's lower per-inference energy; all metrics that include latency favour FP16, ranging from $1.17\times$ (LpW$_{\mathrm{geo}}$) to $1.49\times$ (QpS). (b) Cloud LpW under five server-energy scenarios (log scale), with on-device FP16 (blue dashed) and NF4 (red dotted) reference lines from the cache-enabled experiment. The highlighted bar (orange) is the central GPT-4o short-prompt estimate from Jegham2025. Only the short-prompt central estimate places cloud LpW above on-device FP16; all higher-energy scenarios fall below both edge configurations.
  • Figure 3: Cache regime comparison across three dimensions for FP16 and NF4 configurations of Phi-3 Mini on the NVIDIA T4 ($n = 500$ prompts per configuration). Dark bars show primary study results (use_cache=True); light bars show the secondary cache-disabled experiment (use_cache=False). (a) Mean latency: enabling the cache reduces FP16 latency by $1.80\times$ and NF4 latency by $3.70\times$. (b) Mean net energy: the cache reduces FP16 energy by $1.76\times$ and NF4 energy by $5.72\times$. (c) Mean LpW (log scale): the combined effect produces a $3.35\times$ LpW improvement for FP16 and a $22.1\times$ improvement for NF4, compressing the FP16--NF4 efficiency gap from $8.80\times$ (cache=OFF) to $1.33\times$ (cache=ON).