Mapping Gemma3 onto an Edge Dataflow Architecture

Shouyu Du; Miaoxiang Yu; Zhiheng Ni; Jillian Cai; Qing Yang; Tao Wei; Zhenyu Xu

Mapping Gemma3 onto an Edge Dataflow Architecture

Shouyu Du, Miaoxiang Yu, Zhiheng Ni, Jillian Cai, Qing Yang, Tao Wei, Zhenyu Xu

TL;DR

This paper addresses edge deployment of Gemma3 by mapping transformer-based LLM/VLM workloads onto a tiled AMD Ryzen AI NPU. It introduces hardware-aware techniques—FlowQKV, FlowKV, FusedDQP, and the compact Q4NX quantization—to optimize prefill and decoding, achieving substantial speedups and orders-of-magnitude energy efficiency improvements over CPU and iGPU baselines. The approach demonstrates end-to-end viability for real-time edge inference and provides a generalizable blueprint for translating transformer workloads onto tiled dataflow accelerators. The practical impact is a scalable pathway to on-device LLM/VLM inference with low power and latency, suitable for privacy-preserving applications and edge deployments.

Abstract

We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to $5.2\times$ faster prefill and $4.8\times$ faster decoding versus the iGPU, and $33.5\times$ and $2.2\times$ over the CPU, respectively. Power efficiency improves by as much as $67.2\times$ and $222.9\times$ compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.

Mapping Gemma3 onto an Edge Dataflow Architecture

TL;DR

Abstract

faster prefill and

faster decoding versus the iGPU, and

and

over the CPU, respectively. Power efficiency improves by as much as

and

compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.

Paper Structure (29 sections, 8 equations, 12 figures, 5 tables)

This paper contains 29 sections, 8 equations, 12 figures, 5 tables.

Introduction
Background
AMD Ryzen AI NPU Architecture
Platform Overview
Compute Tiles (CTs)
Memory Tiles (MTs) and Shim Tiles (STs)
Programming
Data Movement on NPU
Model Architecture of Gemma3
Model Architecture
Projection Operation during Prefill--MM
Attention Computation during Prefill
Projection Operation during Decoding--MVM
Attention Computation during Decoding
Nonlinear functions
...and 14 more sections

Figures (12)

Figure 1: NPU architecture overview
Figure 2: CT overview
Figure 3: Data movement architecture on AMD NPU
Figure 4: Gemma3 4B model architecture (text portion; one transformer layer shown). Key model parameters: $D = 2560$ is the model dimension, $H = 8$ is the number of attention heads, and $G = 4$ is the number of KV groups. The head dimension $d = 256$. This $d$ is the dimensionality of each query ($q$), key ($k$), and value ($v$) vector per head. In prefill, $L_p$ represents the sequence length (token length of the current prompt). In decoding, $L_p = 1$. $L$ represents the total sequence length, which increases progressively during autoregressive LLM inference (In local layer, the window length, $L_w$, for SWA is 1024). The attention module (both full and SWA) is highlighted inside the dotted red box.
Figure 5: Diagram of tiling-based MM. The concept of a supertile is specific to the AMD Ryzen AI NPU2 implementation and reflects the aggregated output block computed in $K/k$ load-compute cycles (2 in this example).
...and 7 more figures

Mapping Gemma3 onto an Edge Dataflow Architecture

TL;DR

Abstract

Mapping Gemma3 onto an Edge Dataflow Architecture

Authors

TL;DR

Abstract

Table of Contents

Figures (12)