Table of Contents
Fetching ...

Mapping Gemma3 onto an Edge Dataflow Architecture

Shouyu Du, Miaoxiang Yu, Zhiheng Ni, Jillian Cai, Qing Yang, Tao Wei, Zhenyu Xu

TL;DR

This paper addresses edge deployment of Gemma3 by mapping transformer-based LLM/VLM workloads onto a tiled AMD Ryzen AI NPU. It introduces hardware-aware techniques—FlowQKV, FlowKV, FusedDQP, and the compact Q4NX quantization—to optimize prefill and decoding, achieving substantial speedups and orders-of-magnitude energy efficiency improvements over CPU and iGPU baselines. The approach demonstrates end-to-end viability for real-time edge inference and provides a generalizable blueprint for translating transformer workloads onto tiled dataflow accelerators. The practical impact is a scalable pathway to on-device LLM/VLM inference with low power and latency, suitable for privacy-preserving applications and edge deployments.

Abstract

We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to $5.2\times$ faster prefill and $4.8\times$ faster decoding versus the iGPU, and $33.5\times$ and $2.2\times$ over the CPU, respectively. Power efficiency improves by as much as $67.2\times$ and $222.9\times$ compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.

Mapping Gemma3 onto an Edge Dataflow Architecture

TL;DR

This paper addresses edge deployment of Gemma3 by mapping transformer-based LLM/VLM workloads onto a tiled AMD Ryzen AI NPU. It introduces hardware-aware techniques—FlowQKV, FlowKV, FusedDQP, and the compact Q4NX quantization—to optimize prefill and decoding, achieving substantial speedups and orders-of-magnitude energy efficiency improvements over CPU and iGPU baselines. The approach demonstrates end-to-end viability for real-time edge inference and provides a generalizable blueprint for translating transformer workloads onto tiled dataflow accelerators. The practical impact is a scalable pathway to on-device LLM/VLM inference with low power and latency, suitable for privacy-preserving applications and edge deployments.

Abstract

We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to faster prefill and faster decoding versus the iGPU, and and over the CPU, respectively. Power efficiency improves by as much as and compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.
Paper Structure (29 sections, 8 equations, 12 figures, 5 tables)

This paper contains 29 sections, 8 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: NPU architecture overview
  • Figure 2: CT overview
  • Figure 3: Data movement architecture on AMD NPU
  • Figure 4: Gemma3 4B model architecture (text portion; one transformer layer shown). Key model parameters: $D = 2560$ is the model dimension, $H = 8$ is the number of attention heads, and $G = 4$ is the number of KV groups. The head dimension $d = 256$. This $d$ is the dimensionality of each query ($q$), key ($k$), and value ($v$) vector per head. In prefill, $L_p$ represents the sequence length (token length of the current prompt). In decoding, $L_p = 1$. $L$ represents the total sequence length, which increases progressively during autoregressive LLM inference (In local layer, the window length, $L_w$, for SWA is 1024). The attention module (both full and SWA) is highlighted inside the dotted red box.
  • Figure 5: Diagram of tiling-based MM. The concept of a supertile is specific to the AMD Ryzen AI NPU2 implementation and reflects the aggregated output block computed in $K/k$ load-compute cycles (2 in this example).
  • ...and 7 more figures