Optimizing Offload Performance in Heterogeneous MPSoCs
Luca Colagrande, Luca Benini
TL;DR
This work tackles the overheads that constrain speedups when offloading work to accelerator clusters in heterogeneous MPSoCs. It proposes hardware–software co-design, including a multicast host–to–accelerator interconnect and a centralized synchronization unit, together with an analytic runtime model to predict offload time. Empirical evaluation on a Manticore-derived platform shows large speedups from reducing offload overheads (up to 47.9% for a 1024-d DAXPY) and an accurate runtime model with MAPEs below 1% across configurations, enabling calculation of the minimum cluster count required to satisfy timing constraints. Overall, the results demonstrate practical viability of fine-grained offloads in heterogeneous MPSoCs and provide a tangible methodology for optimizing offload decisions under runtime limits.
Abstract
Heterogeneous multi-core architectures combine a few "host" cores, optimized for single-thread performance, with many small energy-efficient "accelerator" cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small and fine-grained parallel tasks. We demonstrate that by co-designing the hardware and offload routines, we can increase the speedup of an offloaded DAXPY kernel by as much as 47.9%. Furthermore, we show that it is possible to accurately model the runtime of an offloaded application, accounting for the offload overheads, with as low as 1% MAPE error, enabling optimal offload decisions under offload execution time constraints.
