Table of Contents
Fetching ...

Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies

Hoa Nguyen, Pongstorn Maidee, Jason Lowe-Power, Alireza Kaviani

TL;DR

Choreographer tackles the challenge of evaluating latency-sensitive, fine-grained accelerators by delivering a full-system, gem5-based framework that tightly integrates a detailed cache hierarchy, a Linux software stack, and a near-LLC accelerator with ISA-agnostic MMIO offloading. It provides a driver and domain-specific library to streamline application development and demonstrates the framework through a data-aware prefetcher and a quicksort accelerator, achieving up to $1.88x$ and over $2x$ speedups, respectively. The work highlights the critical role of accurate cache modeling and address translation in realistic evaluations and offers a practical tool for optimizing accelerator designs in cache-coherent systems. Overall, Choreographer enables rapid prototyping and thorough system-level analysis of fine-grained offloads, informing design choices for latency-sensitive computing.

Abstract

In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.

Choreographer: A Full-System Framework for Fine-Grained Tasks in Cache Hierarchies

TL;DR

Choreographer tackles the challenge of evaluating latency-sensitive, fine-grained accelerators by delivering a full-system, gem5-based framework that tightly integrates a detailed cache hierarchy, a Linux software stack, and a near-LLC accelerator with ISA-agnostic MMIO offloading. It provides a driver and domain-specific library to streamline application development and demonstrates the framework through a data-aware prefetcher and a quicksort accelerator, achieving up to and over speedups, respectively. The work highlights the critical role of accurate cache modeling and address translation in realistic evaluations and offers a practical tool for optimizing accelerator designs in cache-coherent systems. Overall, Choreographer enables rapid prototyping and thorough system-level analysis of fine-grained offloads, informing design choices for latency-sensitive computing.

Abstract

In this paper, we introduce Choreographer, a simulation framework that enables a holistic system-level evaluation of fine-grained accelerators designed for latency-sensitive tasks. Unlike existing frameworks, Choreographer captures all hardware and software overheads in core-accelerator and cache-accelerator interactions, integrating a detailed gem5-based hardware stack featuring an AMBA coherent hub interface (CHI) mesh network and a complete Linux-based software stack. To facilitate rapid prototyping, it offers a C++ application programming interface and modular configuration options. Our detailed cache model provides accurate insights into performance variations caused by cache configurations, which are not captured by other frameworks. The framework is demonstrated through two case studies: a data-aware prefetcher for graph analytics workloads, and a quicksort accelerator. Our evaluation shows that the prefetcher achieves speedups between 1.08x and 1.88x by reducing memory access latency, while the quicksort accelerator delivers more than 2x speedup with minimal address translation overhead. These findings underscore the ability of Choreographer to model complex hardware-software interactions and optimize performance in small task offloading scenarios.

Paper Structure

This paper contains 22 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: High-level system overview of Choreographer.
  • Figure 2: Address translation in an accelerator can affect overall system performance in an unexpected way. Because functional translation provides translations instantly, one would expect timed translation to yield lower speedups. But functional translations underestimate prefetcher performances by up to 4.3%.
  • Figure 3: Our framework task offloading flow. Prior to the first task offloading, the application asks the driver to allocate an uncacheable page, the physical address range of which is sent to the accelerator configuration channel (e.g., MSRs for X86-64 ISA). A task is sent to the accelerator from the core/application via an uncacheable store request to the uncacheable memory region known by the accelerator. The task status can be queried using an uncacheable load request to the accelerator to the same memory region. As gem5's implementation of the CHI coherence protocol which does not support uncacheable memory requests, we introduce a forwarder object in the simulator which forwards uncacheable requests/responses in a specific memory region between the core/application and the accelerator.
  • Figure 4: Illustration of our detailed cache model included in our framework.
  • Figure 5: Correlation of cache hit latencies to a real system.
  • ...and 8 more figures