Table of Contents
Fetching ...

Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference

Ramchand Kumaresan

TL;DR

Orion is presented, to its knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs.

Abstract

Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL and a runtime that manages IOSurface-backed zero-copy tensor I/O, program caching, and delta compilation for weight updates. Because the ANE bakes weights at compile time, naive training normally requires full recompilation per step (~4.2 s). We show that compiled programs can instead be updated by unloading, patching weight files, and reloading, bypassing ANECCompile() and reducing recompilation from 4,200 ms to 494 ms per step (8.5x), yielding a 3.8x training speedup. On an M4 Max, Orion achieves 170+ tokens/s for GPT-2 124M inference and demonstrates stable training of a 110M-parameter transformer on TinyStories for 1,000 steps in 22 minutes with zero NaN occurrences. We also present LoRA adapter-as-input, enabling hot-swap of adapters via IOSurface inputs without recompilation.

Orion: Characterizing and Programming Apple's Neural Engine for LLM Training and Inference

TL;DR

Orion is presented, to its knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs.

Abstract

Over two billion Apple devices ship with a Neural Processing Unit (NPU) - the Apple Neural Engine (ANE) - yet this accelerator remains largely unused for large language model workloads. CoreML, Apple's public ML framework, imposes opaque abstractions that prevent direct ANE programming and do not support on-device training. We present Orion, to our knowledge the first open end-to-end system that combines direct ANE execution, a compiler pipeline, and stable multi-step training with checkpoint resume in a single native runtime, bypassing CoreML entirely via Apple's private _ANEClient and _ANECompiler APIs. Building on prior characterization work by maderix, we extend public knowledge of ANE constraints to a catalog of 20 restrictions on MIL IR programs, memory layout, compilation limits, and numerical behavior, including 14 previously undocumented constraints discovered during Orion development. Orion includes a compiler that lowers a graph IR through five optimization passes to ANE-native MIL and a runtime that manages IOSurface-backed zero-copy tensor I/O, program caching, and delta compilation for weight updates. Because the ANE bakes weights at compile time, naive training normally requires full recompilation per step (~4.2 s). We show that compiled programs can instead be updated by unloading, patching weight files, and reloading, bypassing ANECCompile() and reducing recompilation from 4,200 ms to 494 ms per step (8.5x), yielding a 3.8x training speedup. On an M4 Max, Orion achieves 170+ tokens/s for GPT-2 124M inference and demonstrates stable training of a 110M-parameter transformer on TinyStories for 1,000 steps in 22 minutes with zero NaN occurrences. We also present LoRA adapter-as-input, enabling hot-swap of adapters via IOSurface inputs without recompilation.
Paper Structure (49 sections, 2 equations, 10 figures, 12 tables, 1 algorithm)

This paper contains 49 sections, 2 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Orion architecture stack. Each layer communicates only with its immediate neighbors. The compiler and runtime together abstract away ANE constraints from the model layer.
  • Figure 2: Weight update paths. v1.0 (left) creates new model descriptors and invokes the ANE compiler for every weight update ($\sim$70 ms/kernel). v2.0 (right) reuses existing model objects: unload, write new weight files, reload ($\sim$9 ms/kernel). The compiler is bypassed entirely.
  • Figure 3: Training step time breakdown. v1.0 spends 83.9% of each step on full ANE recompilation ($\sim$4,200 ms for 60 kernels). v2.0's delta reload reduces this to 494 ms by bypassing ANECCompile() entirely, yielding a 3.8$\times$ total speedup.
  • Figure 4: LoRA-fused linear layer. Base weights $W_{\text{base}}$ are baked into the compiled ANE program (blue). Adapter matrices $A$, $B$ are passed as IOSurface inputs (green) and can be swapped without recompilation. $Y = XW_{\text{base}} + \alpha(XA)B$.
  • Figure 5: Training loss before and after the three-bug NaN fix. ANEgpt diverges to NaN at step 2 with 100% reproducibility (red, dashed arrow indicates divergence to $\infty$). Orion achieves stable, monotonically decreasing loss across 5 steps with checkpoint resume (green).
  • ...and 5 more figures