Streaming Tensor Program: A streaming abstraction for dynamic parallelism

Gina Sohn; Genghan Zhang; Konstantin Hossfeld; Jungwoo Kim; Nathan Sobotka; Nathan Zhang; Olivia Hsu; Kunle Olukotun

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

Gina Sohn, Genghan Zhang, Konstantin Hossfeld, Jungwoo Kim, Nathan Sobotka, Nathan Zhang, Olivia Hsu, Kunle Olukotun

TL;DR

STeP introduces a streaming abstraction for dynamic tensor workloads on spatial dataflow accelerators, addressing the limitations of prior SDAs in handling data-dependent control flow, ragged shapes, and explicit memory hierarchy. By representing data as streams with dynamic tiles, stop-token–based shape semantics, and a rich set of operators (including dynamic routing, merging, and higher-order constructs), STeP enables optimizations such as dynamic tiling, configuration time-multiplexing, and dynamic parallelization while maintaining dataflow efficiency. A symbolic frontend paired with a cycle-approximate simulator quantifies off-chip traffic and on-chip memory and validates against cycle-accurate HDL, demonstrating substantial improvements: 2.18x reduction in on-chip memory with dynamic tiling, 2.57x better compute utilization with configuration time-multiplexing, and 1.5x latency improvement with dynamic parallelization on representative LLM layers. The work highlights how explicit memory hierarchy and shape-aware analysis unlock new schedules for dynamic tensor workloads and provides a foundation for hardware designs that support richer dynamism in SDAs.

Abstract

Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

TL;DR

Abstract

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)