Table of Contents
Fetching ...

DFModel: Design Space Optimization of Large-Scale Systems Exploiting Dataflow Mappings

Sho Ko, Nathan Zhang, Olivia Hsu, Ardavan Pedram, Kunle Olukotun

TL;DR

DFModel introduces a principled, solver-based framework for mapping dataflow graphs onto large-scale systems by performing joint inter-chip and intra-chip optimizations. It represents workloads as dataflow graphs and system specs as constraints, then solves a two-pass mixed-integer program with a Gurobi optimizer to explore an enormous design space up to $O(10^{295})$ points. Across LLM and HPC workloads, DFModel achieves favorable accuracy against prior models and measured performance, with dataflow mappings delivering notable gains over traditional kernel-by-kernel approaches in training and serving scenarios. The approach enables systematic design-space exploration over accelerator types, memory technologies, interconnects, and topologies, offering practical guidance for building future large-scale AI/hpc systems.

Abstract

We propose DFModel, a modeling framework for mapping dataflow computation graphs onto large-scale systems. Mapping a workload to a system requires optimizing dataflow mappings at various levels, including the inter-chip (between chips) level and the intra-chip (within a chip) level. DFModel is, to the best of our knowledge, the first framework to perform the optimization at multiple levels of the memory hierarchy and the interconnection network hierarchy. We use DFModel to explore a wide range of workloads on a variety of systems. Evaluated workloads include two state-of-the-art machine learning applications (Large Language Models and Deep Learning Recommendation Models) and two high-performance computing applications (High Performance LINPACK and Fast Fourier Transform). System parameters investigated span the combination of dataflow and traditional accelerator architectures, memory technologies (DDR, HBM), interconnect technologies (PCIe, NVLink), and interconnection network topologies (torus, DGX, dragonfly). For a variety of workloads on a wide range of systems, the DFModel provided a mapping that predicts an average of 1.25X better performance compared to the ones measured on real systems. DFModel shows that for large language model training, dataflow architectures achieve 1.52X higher performance, 1.59X better cost efficiency, and 1.6X better power efficiency compared to non-dataflow architectures. On an industrial system with dataflow architectures, the DFModel-optimized dataflow mapping achieves a speedup of 6.13X compared to non-dataflow mappings from previous performance models such as Calculon, and 1.52X compared to a vendor provided dataflow mapping.

DFModel: Design Space Optimization of Large-Scale Systems Exploiting Dataflow Mappings

TL;DR

DFModel introduces a principled, solver-based framework for mapping dataflow graphs onto large-scale systems by performing joint inter-chip and intra-chip optimizations. It represents workloads as dataflow graphs and system specs as constraints, then solves a two-pass mixed-integer program with a Gurobi optimizer to explore an enormous design space up to points. Across LLM and HPC workloads, DFModel achieves favorable accuracy against prior models and measured performance, with dataflow mappings delivering notable gains over traditional kernel-by-kernel approaches in training and serving scenarios. The approach enables systematic design-space exploration over accelerator types, memory technologies, interconnects, and topologies, offering practical guidance for building future large-scale AI/hpc systems.

Abstract

We propose DFModel, a modeling framework for mapping dataflow computation graphs onto large-scale systems. Mapping a workload to a system requires optimizing dataflow mappings at various levels, including the inter-chip (between chips) level and the intra-chip (within a chip) level. DFModel is, to the best of our knowledge, the first framework to perform the optimization at multiple levels of the memory hierarchy and the interconnection network hierarchy. We use DFModel to explore a wide range of workloads on a variety of systems. Evaluated workloads include two state-of-the-art machine learning applications (Large Language Models and Deep Learning Recommendation Models) and two high-performance computing applications (High Performance LINPACK and Fast Fourier Transform). System parameters investigated span the combination of dataflow and traditional accelerator architectures, memory technologies (DDR, HBM), interconnect technologies (PCIe, NVLink), and interconnection network topologies (torus, DGX, dragonfly). For a variety of workloads on a wide range of systems, the DFModel provided a mapping that predicts an average of 1.25X better performance compared to the ones measured on real systems. DFModel shows that for large language model training, dataflow architectures achieve 1.52X higher performance, 1.59X better cost efficiency, and 1.6X better power efficiency compared to non-dataflow architectures. On an industrial system with dataflow architectures, the DFModel-optimized dataflow mapping achieves a speedup of 6.13X compared to non-dataflow mappings from previous performance models such as Calculon, and 1.52X compared to a vendor provided dataflow mapping.

Paper Structure

This paper contains 44 sections, 7 equations, 22 figures, 6 tables.

Figures (22)

  • Figure 1: The overview of DFModel. DFModel takes in a workload description represented by a dataflow graph and a system specification including a multi-node distributed system and the individual data parallel chip. DFModel goes through two layers of optimization: an inter-chip layer (1) for hierarchical system-level optimization and an intra-chip layer (3) for hierarchical chip-level optimization. (1) takes in workload and hierarchical system-level specification and produces inter-chip mapping and metrics (2). Then (2) and hierarchical chip-level specification are fed into (3) to produce intra-chip mapping and metrics (4). We assume a typical chip will be within a region close to pareto-optimal design for the balance between memory and computation similar to existing accelerators, such as GPUs choquette2022nvidia, TPUs jouppi2023tpu, and SambaNova RDUs prabhakar2024sambanova.
  • Figure 2: (A) The workload dataflow graph of a single-layer generative pre-training transformer (GPT) model. (B) Inter-chip dataflow mapping: parallelization strategies such as tensor parallelism, pipeline parallelism, and data parallelism are used to map a workload onto an eight-chip system. (C) Intra-chip dataflow mapping: multiple kernels are fused on-chip and data is pipelined through the kernels in a streaming fashion. (D) Intra-chip kernel-by-kernel mapping: kernels are executed sequentially with frequent DRAM accesses between kernels.
  • Figure 3: Four assignment matrices used in DFModel. Matrix $\mathbf{A}$ encodes the kernel to partition assignment, which is useful for deriving other assignment matrices. Matrix $\mathbf{B}$ encodes the tensors which stay within a partition. Matrix $\mathbf{D}$ encodes the tensors which cross two different partitions. Matrix $\mathbf{L}$ encodes the lifetime of cross-partition tensors. Matrix $H$ is not shown due to space constraints.
  • Figure 4: Kernel sharding results in two types of communication cost: communication inherent to the kernel in (A) and tensor layout conversion in (B). Using matrix multiplication as an example, two sharding strategies in (A) shard the tensors in the kernel along different dimensions and incur different communication. For each tensor, different tensor layout conversions in (B) incur different communication.
  • Figure 5: DFModel takes in a workload and a system as inputs. The workload is a dataflow graph and the system is a multi-node distributed system composed of several layers of hierarchical memories, interconnection networks, and compute nodes/accelerators. Each compute node/accelerator is a data-parallel component with on-chip compute units, hierarchical memories, and a main memory. DFModel then undergoes two optimization layers, inter-chip layer (1) and intra-chip layer (3), to find the best dataflow mappings. In (1), DF Model optimizes at the hierarchical distributed system level. To do so, DFModel takes the dataflow graph description and multi-node distributed system specification as inputs and combines them with internal assignment variables. All the variables are then fed into multiple performance equations modeling various aspects of a distributed system. The equations are then encoded as constraints and objectives in Gurobi so that Gurobi iteratively updates the variables. This process is repeated continuously until the objective is reached. Then the inter-chip mapping and its associated variables (2) are fed to the intra-chip optimization level (3). In (3), the inputs are combined with the specification of a data-parallel chip, and (3) iteratively solves the problem similar to the inter-chip level. Eventually, the inter-chip level produces the final dataflow mapping (4).
  • ...and 17 more figures