Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies
Hoa Nguyen, Pongstorn Maidee, Jason Lowe-Power, Alireza Kaviani
TL;DR
Choreographer tackles the challenge of evaluating latency-sensitive, fine-grained accelerators by delivering a full-system, gem5-based framework that tightly integrates a detailed cache hierarchy, a Linux software stack, and a near-LLC accelerator with ISA-agnostic MMIO offloading. It provides a driver and domain-specific library to streamline application development and demonstrates the framework through a data-aware prefetcher and a quicksort accelerator, achieving up to $1.88x$ and over $2x$ speedups, respectively. The work highlights the critical role of accurate cache modeling and address translation in realistic evaluations and offers a practical tool for optimizing accelerator designs in cache-coherent systems. Overall, Choreographer enables rapid prototyping and thorough system-level analysis of fine-grained offloads, informing design choices for latency-sensitive computing.
Abstract
In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.
