Mapping Gemma3 onto an Edge Dataflow Architecture
Shouyu Du, Miaoxiang Yu, Zhiheng Ni, Jillian Cai, Qing Yang, Tao Wei, Zhenyu Xu
TL;DR
This paper addresses edge deployment of Gemma3 by mapping transformer-based LLM/VLM workloads onto a tiled AMD Ryzen AI NPU. It introduces hardware-aware techniques—FlowQKV, FlowKV, FusedDQP, and the compact Q4NX quantization—to optimize prefill and decoding, achieving substantial speedups and orders-of-magnitude energy efficiency improvements over CPU and iGPU baselines. The approach demonstrates end-to-end viability for real-time edge inference and provides a generalizable blueprint for translating transformer workloads onto tiled dataflow accelerators. The practical impact is a scalable pathway to on-device LLM/VLM inference with low power and latency, suitable for privacy-preserving applications and edge deployments.
Abstract
We present the first end-to-end deployment of the Gemma3 family of large language and vision models on a tiled edge dataflow architecture (AMD Ryzen AI NPU). Our work introduces a set of hardware-aware techniques. For prefill, we introduce an efficient dequantization engine, optimize tiled matrix multiplication kernels, and propose FlowQKV, a chunked, pipelined attention mechanism. For decoding, we introduce FusedDQP, which fuses dequantization and projection into a single kernel, and FlowKV, which re-structures attention to sustain high memory bandwidth utilization. Together with a compact Q4NX 4-bit quantization format, these methods yield up to $5.2\times$ faster prefill and $4.8\times$ faster decoding versus the iGPU, and $33.5\times$ and $2.2\times$ over the CPU, respectively. Power efficiency improves by as much as $67.2\times$ and $222.9\times$ compared to the iGPU and CPU. The proposed approach demonstrates that modern NPUs can deliver practical, low-power LLM and VLM inference at the edge, and provides a generalizable blueprint for mapping transformer-based models onto tiled dataflow accelerators.
