From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu; Yudong Pan; Mengdi Wang; Huawei Li; Yinhe Han; Xiaowei Li; Ying Wang

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Jinxin Yu, Yudong Pan, Mengdi Wang, Huawei Li, Yinhe Han, Xiaowei Li, Ying Wang

TL;DR

This work addresses the energy bottleneck in Transformer inference by shifting from 2D to hybrid-bonded 3D architectures. By vertically stacking processing elements and enabling register-to-register dataflow, the proposed 3D-Flow substrate supports fine-grained, bubble-free execution of FlashAttention across four coupled tiers, eliminating costly SRAM round-trips. The accompanying 3D-FlashAttention dataflow balanced across tiers delivers substantial energy savings ($46$–$93\%$) and speedups ($1.4$–$7.6\times$) on OPT and Qwen workloads, outperforming state-of-the-art 2D and 3D baselines. This co-design approach reduces both off-chip and on-chip traffic, enabling more scalable and energy-efficient Transformer inference with potential applicability to other fused operators.

Abstract

Transformer-based models dominate modern AI workloads but exacerbate memory bottlenecks due to their quadratic attention complexity and ever-growing model sizes. Existing accelerators, such as Groq and Cerebras, mitigate off-chip traffic with large on-chip caches, while algorithmic innovations such as FlashAttention fuse operators to avoid materializing large attention matrices. However, as off-chip traffic decreases, our measurements show that on-chip SRAM accesses account for over 60% of energy in long-sequence workloads, making cache access the new bottleneck. We propose 3D-Flow, a hybrid-bonded, 3D-stacked spatial accelerator that enables register-to-register communication across vertically partitioned PE tiers. Unlike 2D multi-array architectures limited by NoC-based router-to-router transfers, 3D-Flow leverages sub-10 um vertical TSVs to sustain cycle-level operator pipelining with minimal overhead. On top of this architecture, we design 3D-FlashAttention, a fine-grained scheduling method that balances latency across tiers, forming a bubble-free vertical dataflow without on-chip SRAM roundtrips. Evaluations on Transformer workloads (OPT and QWEN models) show that our 3D spatial accelerator reduces 46-93% energy consumption and achieves 1.4x-7.6x speedups compared to state-of-the-art 2D and 3D designs.

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

TL;DR

–

) and speedups (

–

) on OPT and Qwen workloads, outperforming state-of-the-art 2D and 3D baselines. This co-design approach reduces both off-chip and on-chip traffic, enabling more scalable and energy-efficient Transformer inference with potential applicability to other fused operators.

Abstract

Paper Structure (23 sections, 13 figures, 2 tables, 1 algorithm)

This paper contains 23 sections, 13 figures, 2 tables, 1 algorithm.

Introduction
Background and Motivation
FlashAttention: From GPU to Systolic Array
Opportunities Enabled by 3D Integration
Motivation
3D-Flow Architecture
Overview
PE Design
Thermal Feasibility
3D-Flashattention Dataflow
Overview
Layer-wise Dataflow
Layer 0: $\mathbf{QK^{T}}$
Layer 1: Row-wise Maximum and Subtraction
Layer 2: Exponential-related Operations
...and 8 more sections

Figures (13)

Figure 1: Energy breakdown of operator fusion and unfusion with different sequence lengths for OPT.
Figure 2: Overview of 3D-stacked PE array architecture and the operator mapping of each layer.
Figure 3: PE in layer_0.
Figure 4: PE in layer_1.
Figure 5: PE in layer_2.
...and 8 more figures

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

TL;DR

Abstract

From Buffers to Registers: Unlocking Fine-Grained FlashAttention with Hybrid-Bonded 3D NPU Co-Design

Authors

TL;DR

Abstract

Table of Contents

Figures (13)