Table of Contents
Fetching ...

DG-RePlAce: A Dataflow-Driven GPU-Accelerated Analytical Global Placement Framework for Machine Learning Accelerators

Andrew B. Kahng, Zhiang Wang

TL;DR

DG-RePlAce tackles global placement scalability for ML accelerators by introducing a GPU-accelerated framework that leverages dataflow and datapath regularities within OpenROAD. It integrates physical hierarchy extraction, dataflow-driven initial distribution, and datapath constraints into a parallel analytical placement flow, using virtual connections and pseudo nets to guide placement. Empirical results on Tabla/GeneSys show consistent improvements in routed wirelength and timing, with competitive runtimes and strong post-route gains on large benchmarks like TILOS, indicating the approach generalizes beyond ML accelerators. The work advances fast, high-quality placement by aligning layout with design dataflow, enabling faster design closure for modern datapath-rich accelerators.

Abstract

Global placement is a fundamental step in VLSI physical design. The wide use of 2D processing element (PE) arrays in machine learning accelerators poses new challenges of scalability and Quality of Results (QoR) for state-of-the-art academic global placers. In this work, we develop DG-RePlAce, a new and fast GPU-accelerated global placement framework built on top of the OpenROAD infrastructure, which exploits the inherent dataflow and datapath structures of machine learning accelerators. Experimental results with a variety of machine learning accelerators using a commercial 12nm enablement show that, compared with RePlAce (DREAMPlace), our approach achieves an average reduction in routed wirelength by 10% (7%) and total negative slack (TNS) by 31% (34%), with faster global placement and on-par total runtimes relative to DREAMPlace. Empirical studies on the TILOS MacroPlacement Benchmarks further demonstrate that post-route improvements over RePlAce and DREAMPlace may reach beyond the motivating application to machine learning accelerators.

DG-RePlAce: A Dataflow-Driven GPU-Accelerated Analytical Global Placement Framework for Machine Learning Accelerators

TL;DR

DG-RePlAce tackles global placement scalability for ML accelerators by introducing a GPU-accelerated framework that leverages dataflow and datapath regularities within OpenROAD. It integrates physical hierarchy extraction, dataflow-driven initial distribution, and datapath constraints into a parallel analytical placement flow, using virtual connections and pseudo nets to guide placement. Empirical results on Tabla/GeneSys show consistent improvements in routed wirelength and timing, with competitive runtimes and strong post-route gains on large benchmarks like TILOS, indicating the approach generalizes beyond ML accelerators. The work advances fast, high-quality placement by aligning layout with design dataflow, enabling faster design closure for modern datapath-rich accelerators.

Abstract

Global placement is a fundamental step in VLSI physical design. The wide use of 2D processing element (PE) arrays in machine learning accelerators poses new challenges of scalability and Quality of Results (QoR) for state-of-the-art academic global placers. In this work, we develop DG-RePlAce, a new and fast GPU-accelerated global placement framework built on top of the OpenROAD infrastructure, which exploits the inherent dataflow and datapath structures of machine learning accelerators. Experimental results with a variety of machine learning accelerators using a commercial 12nm enablement show that, compared with RePlAce (DREAMPlace), our approach achieves an average reduction in routed wirelength by 10% (7%) and total negative slack (TNS) by 31% (34%), with faster global placement and on-par total runtimes relative to DREAMPlace. Empirical studies on the TILOS MacroPlacement Benchmarks further demonstrate that post-route improvements over RePlAce and DREAMPlace may reach beyond the motivating application to machine learning accelerators.
Paper Structure (17 sections, 5 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustrative execution flow of a systolic array-based machine learning accelerator (figure reproduced from EsmaeilzadehGGGK21).
  • Figure 2: Overview of the proposed DG-RePlAce flow.
  • Figure 3: Dataflow visualization of the Tabla01 design EsmaeilzadehGGGK21.
  • Figure 4: Illustration of the bloat-shrink approach for reducing $cluster\_overflow$. Left: density overflow caused by overlap between clusters A and B; Right: removal of overlap by shrinking clusters A and B.
  • Figure 5: Datapath constraints construction on the 2D PE array.
  • ...and 7 more figures