Table of Contents
Fetching ...

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

Mohamed Shahawy, Julien de Castelnau, Paolo Ienne

TL;DR

Bombyx tackles the challenge of mapping task-level parallel software (OpenCilk) to FPGA-based TLP by translating OpenCilk programs into an explicit continuation-passing style IR that is amenable to hardware mapping. It introduces a two-stage backend: lowering to an explicit CPS-style IR and then generating synthesizable PEs for HardCilk via an HLS flow, plus a decoupled access-execute optimization to overlap memory and compute. The key contributions include the explicit-intermediate representation, path-based continuation extraction, automated HardCilk lowering with alignment and buffers, and the DAE pragma-driven optimization with demonstrated runtime gains. This work enables automated generation of FPGA accelerators from CPU-oriented TLP code and shows practical performance benefits, validating the approach on graph-traversal workloads.

Abstract

Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.

Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration

TL;DR

Bombyx tackles the challenge of mapping task-level parallel software (OpenCilk) to FPGA-based TLP by translating OpenCilk programs into an explicit continuation-passing style IR that is amenable to hardware mapping. It introduces a two-stage backend: lowering to an explicit CPS-style IR and then generating synthesizable PEs for HardCilk via an HLS flow, plus a decoupled access-execute optimization to overlap memory and compute. The key contributions include the explicit-intermediate representation, path-based continuation extraction, automated HardCilk lowering with alignment and buffers, and the DAE pragma-driven optimization with demonstrated runtime gains. This work enables automated generation of FPGA accelerators from CPU-oriented TLP code and shows practical performance benefits, validating the approach on graph-traversal workloads.

Abstract

Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.

Paper Structure

This paper contains 7 sections, 5 figures.

Figures (5)

  • Figure 1: OpenCilk code for a program computing Fibonacci. The use of spawn and sync represents implicit parallelism---that is, the programmer guarantees by the spawn keywords that the invoked function is independent, and the keyword sync implicitly synchronizes the spawned functions.
  • Figure 2: Cilk-1 code for computing Fibonacci. Replacing the sync keyword, compared to Figure \ref{['lst:fib_oc']}, with send_argument and spawn_next represents explicit parallelism---that is, the user explicitly specifies the function to be executed after all the spawned functions complete, as opposed to resuming within the same function after a sync.
  • Figure 3: Compilation flow of Bombyx for an OpenCilk TLP program: (1) OpenCilk Clang extracts the abstract syntax tree (AST); (2) the AST is converted into an implicit intermediate representation (IR), where optimizations are applied; (3) the implicit IR is lowered to an explicit IR, which maps to hardware task-parallel frameworks.
  • Figure 4: Intermediate representations. Blue and red blocks are entry and exit blocks, respectively. In the Bombyx IRs, T: denotes the terminating statement of the basic block. (a) TAPIR is an LLVM derivative used by OpenCilk for compilation. This IR changes the structure of the original C++ code to apply optimizations for software, which makes it harder to generate readable explicit C++ code for HLS. (b) The implicit IR, an intermediate IR used by Bombyx as it easier to convert to explicit IR than using the AST generated by the OpenCilk Clang frontend. This IR preserves the original structure of the C++ code. (c) The explicit IR, the final IR generated by Bombyx, which can be directly mapped to hardware primitives.
  • Figure 6: Synthesis results for DAE optimization PEs.