DFModel: Design Space Optimization of Large-Scale Systems Exploiting Dataflow Mappings
Sho Ko, Nathan Zhang, Olivia Hsu, Ardavan Pedram, Kunle Olukotun
TL;DR
DFModel introduces a principled, solver-based framework for mapping dataflow graphs onto large-scale systems by performing joint inter-chip and intra-chip optimizations. It represents workloads as dataflow graphs and system specs as constraints, then solves a two-pass mixed-integer program with a Gurobi optimizer to explore an enormous design space up to $O(10^{295})$ points. Across LLM and HPC workloads, DFModel achieves favorable accuracy against prior models and measured performance, with dataflow mappings delivering notable gains over traditional kernel-by-kernel approaches in training and serving scenarios. The approach enables systematic design-space exploration over accelerator types, memory technologies, interconnects, and topologies, offering practical guidance for building future large-scale AI/hpc systems.
Abstract
We propose DFModel, a modeling framework for mapping dataflow computation graphs onto large-scale systems. Mapping a workload to a system requires optimizing dataflow mappings at various levels, including the inter-chip (between chips) level and the intra-chip (within a chip) level. DFModel is, to the best of our knowledge, the first framework to perform the optimization at multiple levels of the memory hierarchy and the interconnection network hierarchy. We use DFModel to explore a wide range of workloads on a variety of systems. Evaluated workloads include two state-of-the-art machine learning applications (Large Language Models and Deep Learning Recommendation Models) and two high-performance computing applications (High Performance LINPACK and Fast Fourier Transform). System parameters investigated span the combination of dataflow and traditional accelerator architectures, memory technologies (DDR, HBM), interconnect technologies (PCIe, NVLink), and interconnection network topologies (torus, DGX, dragonfly). For a variety of workloads on a wide range of systems, the DFModel provided a mapping that predicts an average of 1.25X better performance compared to the ones measured on real systems. DFModel shows that for large language model training, dataflow architectures achieve 1.52X higher performance, 1.59X better cost efficiency, and 1.6X better power efficiency compared to non-dataflow architectures. On an industrial system with dataflow architectures, the DFModel-optimized dataflow mapping achieves a speedup of 6.13X compared to non-dataflow mappings from previous performance models such as Calculon, and 1.52X compared to a vendor provided dataflow mapping.
