Table of Contents
Fetching ...

Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units

Neelesh Gupta, Rakshith Jayanth, Dhruv Parikh, Viktor Prasanna

TL;DR

Edge devices face a fundamental mismatch between the memory-heavy, quadratic attention of traditional transformers and the constrained memory/compute patterns of NPUs. The authors perform a rigorous empirical study across quadratic and sub-quadratic causal operators, supplemented by a roofline performance model, to reveal that quadratic attention is memory-bound due to cache inefficiency, while sub-quadratic variants encounter compute or DMA bottlenecks on vector cores. Structured operators like Toeplitz and Linear attention map more effectively to the NPU dataflow, achieving higher utilization and scalable latency for long-context inference. The work provides actionable co-design guidance for hardware-aware model design and compiler optimizations to enable private, on-device long-context AI at edge scales.

Abstract

The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with NPU memory and compute patterns. This paper presents a comprehensive performance analysis of causal inference operators on a modern NPU, benchmarking quadratic attention against sub-quadratic alternatives including structured state-space models and causal convolutions. Our analysis reveals a spectrum of critical bottlenecks: quadratic attention becomes severely memory-bound with catastrophic cache inefficiency, while sub-quadratic variants span from compute-bound on programmable vector cores to memory-bound by data movement. These findings provide essential insights for co-designing hardware-aware models and optimization strategies to enable efficient long-context inference on edge platforms.

Context-Driven Performance Modeling for Causal Inference Operators on Neural Processing Units

TL;DR

Edge devices face a fundamental mismatch between the memory-heavy, quadratic attention of traditional transformers and the constrained memory/compute patterns of NPUs. The authors perform a rigorous empirical study across quadratic and sub-quadratic causal operators, supplemented by a roofline performance model, to reveal that quadratic attention is memory-bound due to cache inefficiency, while sub-quadratic variants encounter compute or DMA bottlenecks on vector cores. Structured operators like Toeplitz and Linear attention map more effectively to the NPU dataflow, achieving higher utilization and scalable latency for long-context inference. The work provides actionable co-design guidance for hardware-aware model design and compiler optimizations to enable private, on-device long-context AI at edge scales.

Abstract

The proliferation of large language models has driven demand for long-context inference on resource-constrained edge platforms. However, deploying these models on Neural Processing Units (NPUs) presents significant challenges due to architectural mismatch: the quadratic complexity of standard attention conflicts with NPU memory and compute patterns. This paper presents a comprehensive performance analysis of causal inference operators on a modern NPU, benchmarking quadratic attention against sub-quadratic alternatives including structured state-space models and causal convolutions. Our analysis reveals a spectrum of critical bottlenecks: quadratic attention becomes severely memory-bound with catastrophic cache inefficiency, while sub-quadratic variants span from compute-bound on programmable vector cores to memory-bound by data movement. These findings provide essential insights for co-designing hardware-aware models and optimization strategies to enable efficient long-context inference on edge platforms.

Paper Structure

This paper contains 30 sections, 19 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Differences in persistent memory and layer-wise dataflow for Attention-based Llama vs. SSM-based Mamba.
  • Figure 2: NPU dataflow architecture with processing elements (PEs) and accumulator hierarchy. Note the absence of high-bandwidth memory for persistent context storage.
  • Figure 3: Preserving causality across operator/context types.
  • Figure 4: Structured masked attention variants with differing causal matrices.
  • Figure 5: Primitive operation distribution across NPU hardware units for three representative causal operators.
  • ...and 3 more figures