Table of Contents
Fetching ...

Warp-STAR: High-performance, Differentiable GPU-Accelerated Static Timing Analysis through Warp-oriented Parallel Orchestration

En-Ming Huang, Shih-Hao Hung

Abstract

Static timing analysis (STA) is crucial for Electronic Design Automation (EDA) flows but remains a computational bottleneck. While existing GPU-based STA engines are faster than CPU, they suffer from inefficiencies, particularly intra-warp load imbalance caused by irregular circuit graphs. This paper introduces Warp-STAR, a novel GPU-accelerated STA engine that eliminates this imbalance by orchestrating parallel computations at the warp level. This approach achieves a 2.4X speedup over previous state-of-the-art (SoTA) GPU-based STA. When integrated into a timing-driven global placement framework, Warp-STAR delivers a 1.7X speedup over SoTA frameworks. The method also proves effective for differentiable gradient analysis with minimal overhead.

Warp-STAR: High-performance, Differentiable GPU-Accelerated Static Timing Analysis through Warp-oriented Parallel Orchestration

Abstract

Static timing analysis (STA) is crucial for Electronic Design Automation (EDA) flows but remains a computational bottleneck. While existing GPU-based STA engines are faster than CPU, they suffer from inefficiencies, particularly intra-warp load imbalance caused by irregular circuit graphs. This paper introduces Warp-STAR, a novel GPU-accelerated STA engine that eliminates this imbalance by orchestrating parallel computations at the warp level. This approach achieves a 2.4X speedup over previous state-of-the-art (SoTA) GPU-based STA. When integrated into a timing-driven global placement framework, Warp-STAR delivers a 1.7X speedup over SoTA frameworks. The method also proves effective for differentiable gradient analysis with minimal overhead.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: Circuit components and levelization in STA. (a) The components of a circuit design. (b) The levelization process.
  • Figure 2: Illustration of thread divergence in a 4-threaded warp.
  • Figure 3: Comparison of task scheduling strategies for threads T0$\sim$T2. (a) Net based scheme: each thread processes an entire net. (b) Pin based scheme: each thread processes individual pins, allowing for higher parallelism. (c) Collaborative Task Engagement (CTE), which dynamically reschedules workloads within the thread block to maximize utilization.
  • Figure 4: Overlapped execution of AT propagation and gradient computation in Warp-STAR. CUDA events are employed, allowing concurrent execution while ensuring correctness.
  • Figure 5: Runtime breakdown of the aes_cipher_top test case.
  • ...and 3 more figures