Table of Contents
Fetching ...

LightningSimV2: Faster and Scalable Simulation for High-Level Synthesis via Graph Compilation and Optimization

Rishov Sarkar, Rachel Paul, Cong Hao

TL;DR

LightningSimV2 addresses the slow feedback loop in HLS evaluation by replacing repetitive trace generation with static analysis, and by decoupling stall calculation into a graph construction and a lightweight traversal. The approach yields lossless accuracy compared to LightningSim and substantial speedups: up to 3.5x end-to-end, up to 6.4x for trace analysis, and up to 577x for incremental stall calculation, while dramatically reducing memory usage. The decoupled graph framework enables scalable, parallelizable DSE for hardware parameters such as FIFO depths, making large design spaces tractable. The work provides an open-source toolchain that promises faster, more scalable exploration of complex HLS designs, and lays groundwork for handling nondeterministic behavior in future extensions.

Abstract

High-Level Synthesis (HLS) enables rapid prototyping of complex hardware designs by translating C or C++ code to low-level RTL code. However, the testing and evaluation of HLS designs still typically rely on slow RTL-level simulators that can take hours to provide feedback, especially for complex designs. A recent work, LightningSim, helps to solve this problem by providing a simulation workflow one to two orders of magnitude faster than RTL simulation. However, it still exhibits inefficiencies due to several types of redundant computation, making it slow for large design simulation and design space exploration. Addressing these inefficiencies, we introduce LightningSimV2, a much faster and scalable simulation tool. LightningSimV2 features three main innovations. First, we perform compile-time static analysis, exploiting the repetitive structures in HLS designs, e.g., loops, to reduce the simulation workload. Second, we propose a novel graph-based simulation approach, with decoupled simulation graph construction step and graph traversal step, significantly reducing repeated computation. Third, benefiting from the decoupled approach, LightningSimV2 can perform incremental stall analysis extremely fast, enabling highly efficient design space exploration of large numbers of complex hardware parameters, e.g., optimal FIFO depths. Moreover, the DSE is well-suited for parallel computing, further improving the DSE efficiency. Compared with LightningSim, LightningSimV2 achieves up to 3.5x speedup in full simulation and up to 577x speed up for incremental DSE. Our code is open-source on GitHub at https://github.com/sharc-lab/LightningSim/tree/v0.2.0.

LightningSimV2: Faster and Scalable Simulation for High-Level Synthesis via Graph Compilation and Optimization

TL;DR

LightningSimV2 addresses the slow feedback loop in HLS evaluation by replacing repetitive trace generation with static analysis, and by decoupling stall calculation into a graph construction and a lightweight traversal. The approach yields lossless accuracy compared to LightningSim and substantial speedups: up to 3.5x end-to-end, up to 6.4x for trace analysis, and up to 577x for incremental stall calculation, while dramatically reducing memory usage. The decoupled graph framework enables scalable, parallelizable DSE for hardware parameters such as FIFO depths, making large design spaces tractable. The work provides an open-source toolchain that promises faster, more scalable exploration of complex HLS designs, and lays groundwork for handling nondeterministic behavior in future extensions.

Abstract

High-Level Synthesis (HLS) enables rapid prototyping of complex hardware designs by translating C or C++ code to low-level RTL code. However, the testing and evaluation of HLS designs still typically rely on slow RTL-level simulators that can take hours to provide feedback, especially for complex designs. A recent work, LightningSim, helps to solve this problem by providing a simulation workflow one to two orders of magnitude faster than RTL simulation. However, it still exhibits inefficiencies due to several types of redundant computation, making it slow for large design simulation and design space exploration. Addressing these inefficiencies, we introduce LightningSimV2, a much faster and scalable simulation tool. LightningSimV2 features three main innovations. First, we perform compile-time static analysis, exploiting the repetitive structures in HLS designs, e.g., loops, to reduce the simulation workload. Second, we propose a novel graph-based simulation approach, with decoupled simulation graph construction step and graph traversal step, significantly reducing repeated computation. Third, benefiting from the decoupled approach, LightningSimV2 can perform incremental stall analysis extremely fast, enabling highly efficient design space exploration of large numbers of complex hardware parameters, e.g., optimal FIFO depths. Moreover, the DSE is well-suited for parallel computing, further improving the DSE efficiency. Compared with LightningSim, LightningSimV2 achieves up to 3.5x speedup in full simulation and up to 577x speed up for incremental DSE. Our code is open-source on GitHub at https://github.com/sharc-lab/LightningSim/tree/v0.2.0.
Paper Structure (26 sections, 6 figures, 4 tables)

This paper contains 26 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: LightningSimV2 addresses three major limitations in LightningSim with three innovations. First, it uses static analysis to reduce the size of the generated execution trace by avoiding repeating redundant design patterns, e.g., loops. Second, it largely speeds up stall calculation by a decoupled graph compilation and graph traversal during trace analysis. Third, it largely speeds up DSE for hardware parameters, e.g., FIFO depths, by only executing the lightweight graph traversal step, which is suitable for scalable and parallel computing with little memory overhead.
  • Figure 2: Illustration of the trace analysis stage in LightningSim, including schedule resolution and stall calculation. Schedule resolution correlates static stages to dynamic stages. Stall calculation uses an event-based simulation to calculate the final clock count, which involves massive computation redundancy since the same events may be checked repeatedly, and the number of events can be large.
  • Figure 3: Illustration of the decoupled stall calculation in LightningSimV2, including two steps. Step 1: dynamic graph construction, which is a one-time effort. Step 2: graph traversal for clock count calculation, which is extremely lightweight and can be applied for DSE, e.g., with different FIFO depths.
  • Figure 4: The LightningSimV2 graph compiler architecture. Trace resolution produces a stream of events, each of which is timestamped with a dynamic stage in its corresponding module. These events are attached to pending nodes within the call stack and added to tracking structures. Eventually, when a pending node is committed, its events are used to update tracking structures that also determine what edges need to be created in the graph.
  • Figure 5: Parallel computing for DSE. The graph construction is a one-time effort: the compiled simulation graph $G$ will not be changed or modified, and thus is read-only and will not be copied to different cores. During DSE, only the incremental stall calculation step is re-executed in parallel with a batch of design points.
  • ...and 1 more figures