Table of Contents
Fetching ...

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

Gina Sohn, Genghan Zhang, Konstantin Hossfeld, Jungwoo Kim, Nathan Sobotka, Nathan Zhang, Olivia Hsu, Kunle Olukotun

TL;DR

STeP introduces a streaming abstraction for dynamic tensor workloads on spatial dataflow accelerators, addressing the limitations of prior SDAs in handling data-dependent control flow, ragged shapes, and explicit memory hierarchy. By representing data as streams with dynamic tiles, stop-token–based shape semantics, and a rich set of operators (including dynamic routing, merging, and higher-order constructs), STeP enables optimizations such as dynamic tiling, configuration time-multiplexing, and dynamic parallelization while maintaining dataflow efficiency. A symbolic frontend paired with a cycle-approximate simulator quantifies off-chip traffic and on-chip memory and validates against cycle-accurate HDL, demonstrating substantial improvements: 2.18x reduction in on-chip memory with dynamic tiling, 2.57x better compute utilization with configuration time-multiplexing, and 1.5x latency improvement with dynamic parallelization on representative LLM layers. The work highlights how explicit memory hierarchy and shape-aware analysis unlock new schedules for dynamic tensor workloads and provides a foundation for hardware designs that support richer dynamism in SDAs.

Abstract

Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

Streaming Tensor Program: A streaming abstraction for dynamic parallelism

TL;DR

STeP introduces a streaming abstraction for dynamic tensor workloads on spatial dataflow accelerators, addressing the limitations of prior SDAs in handling data-dependent control flow, ragged shapes, and explicit memory hierarchy. By representing data as streams with dynamic tiles, stop-token–based shape semantics, and a rich set of operators (including dynamic routing, merging, and higher-order constructs), STeP enables optimizations such as dynamic tiling, configuration time-multiplexing, and dynamic parallelization while maintaining dataflow efficiency. A symbolic frontend paired with a cycle-approximate simulator quantifies off-chip traffic and on-chip memory and validates against cycle-accurate HDL, demonstrating substantial improvements: 2.18x reduction in on-chip memory with dynamic tiling, 2.57x better compute utilization with configuration time-multiplexing, and 1.5x latency improvement with dynamic parallelization on representative LLM layers. The work highlights how explicit memory hierarchy and shape-aware analysis unlock new schedules for dynamic tensor workloads and provides a foundation for hardware designs that support richer dynamism in SDAs.

Abstract

Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

Paper Structure

This paper contains 36 sections, 2 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: A tensor represented as a rank-2 STeP stream. The done token (D) denotes the end of the stream.
  • Figure 2: Example STeP graph for an MoE layer in Mixtral8x7B. A token routes to 2 experts, and an expert computes $(SiLU(xW_1)*(xW_3))W_2$. A stream is described with its shape and the shape of tiles on it. Different colors represent different kinds of STeP operators. Some operators have arguments omitted for simplicity. Expand with n=[224] is a syntax sugar for static repeating.
  • Figure 3: An example of a Partition operator. $B_i$ in each output stream is a newly created dynamic regular dimension.
  • Figure 4: An example of a Reassemble operator. The multihot vector is expressed as tuples. Unless the selector is a k-hot, $B_{sel}$ is a ragged dimension as the selector is multihot.
  • Figure 5: Cycle-count and memory traffic comparison of a SwiGLU Layer with different tile sizes. The full sizes of the batch dimension, hidden dimension, and MoE intermediate dimension are 64, 256, and 512, respectively.
  • ...and 8 more figures