Bombyx: OpenCilk Compilation for FPGA Hardware Acceleration
Mohamed Shahawy, Julien de Castelnau, Paolo Ienne
TL;DR
Bombyx tackles the challenge of mapping task-level parallel software (OpenCilk) to FPGA-based TLP by translating OpenCilk programs into an explicit continuation-passing style IR that is amenable to hardware mapping. It introduces a two-stage backend: lowering to an explicit CPS-style IR and then generating synthesizable PEs for HardCilk via an HLS flow, plus a decoupled access-execute optimization to overlap memory and compute. The key contributions include the explicit-intermediate representation, path-based continuation extraction, automated HardCilk lowering with alignment and buffers, and the DAE pragma-driven optimization with demonstrated runtime gains. This work enables automated generation of FPGA accelerators from CPU-oriented TLP code and shows practical performance benefits, validating the approach on graph-traversal workloads.
Abstract
Task-level parallelism (TLP) is a widely used approach in software where independent tasks are dynamically created and scheduled at runtime. Recent systems have explored architectural support for TLP on field-programmable gate arrays (FPGAs), often leveraging high-level synthesis (HLS) to create processing elements (PEs). In this paper, we present Bombyx, a compiler toolchain that lowers OpenCilk programs into a Cilk-1-inspired intermediate representation, enabling efficient mapping of CPU-oriented TLP applications to spatial architectures on FPGAs. Unlike OpenCilk's implicit task model, which requires costly context switching in hardware, Cilk-1 adopts explicit continuation-passing - a model that better aligns with the streaming nature of FPGAs. Bombyx supports multiple compilation targets: one is an OpenCilk-compatible runtime for executing Cilk-1-style code using the OpenCilk backend, and another is a synthesizable PE generator designed for HLS tools like Vitis HLS. Additionally, we introduce a decoupled access-execute optimization that enables automatic generation of high-performance PEs, improving memory-compute overlap and overall throughput.
