Table of Contents
Fetching ...

LRSCwait: Enabling Scalable and Efficient Synchronization in Manycore Systems through Polling-Free and Retry-Free Operation

Samuel Riedel, Marc Gantenbein, Alessandro Ottaviano, Torsten Hoefler, Luca Benini

TL;DR

Polling overhead in shared-memory manycore synchronization degrades throughput and energy efficiency. The authors introduce LRwait, SCwait, and Mwait to enable polling-free, slept waiting, and a scalable Colibri reservation-queue implementation for manycore systems. They provide correctness guarantees (mutual exclusion, deadlock and starvation freedom) and explore ideal and optimized hardware designs, plus an extension with Mwait. Empirical results on a 256-core Mem-Pool platform show Colibri achieving up to 6.5x throughput and up to 8.8x energy efficiency improvements over LRSCwait-based approaches, with only ~6% area overhead, demonstrating strong practical impact for scalable RISC-V manycore synchronization.

Abstract

Extensive polling in shared-memory manycore systems can lead to contention, decreased throughput, and poor energy efficiency. Both lock implementations and the general-purpose atomic operation, load-reserved/store-conditional (LRSC), cause polling due to serialization and retries. To alleviate this overhead, we propose LRwait and SCwait, a synchronization pair that eliminates polling by allowing contending cores to sleep while waiting for previous cores to finish their atomic access. As a scalable implementation of LRwait, we present Colibri, a distributed and scalable approach to managing LRwait reservations. Through extensive benchmarking on an open-source RISC-V platform with 256 cores, we demonstrate that Colibri outperforms current synchronization approaches for various concurrent algorithms with high and low contention regarding throughput, fairness, and energy efficiency. With an area overhead of only 6%, Colibri outperforms LRSC-based implementations by a factor of 6.5x in terms of throughput and 7.1x in terms of energy efficiency.

LRSCwait: Enabling Scalable and Efficient Synchronization in Manycore Systems through Polling-Free and Retry-Free Operation

TL;DR

Polling overhead in shared-memory manycore synchronization degrades throughput and energy efficiency. The authors introduce LRwait, SCwait, and Mwait to enable polling-free, slept waiting, and a scalable Colibri reservation-queue implementation for manycore systems. They provide correctness guarantees (mutual exclusion, deadlock and starvation freedom) and explore ideal and optimized hardware designs, plus an extension with Mwait. Empirical results on a 256-core Mem-Pool platform show Colibri achieving up to 6.5x throughput and up to 8.8x energy efficiency improvements over LRSCwait-based approaches, with only ~6% area overhead, demonstrating strong practical impact for scalable RISC-V manycore synchronization.

Abstract

Extensive polling in shared-memory manycore systems can lead to contention, decreased throughput, and poor energy efficiency. Both lock implementations and the general-purpose atomic operation, load-reserved/store-conditional (LRSC), cause polling due to serialization and retries. To alleviate this overhead, we propose LRwait and SCwait, a synchronization pair that eliminates polling by allowing contending cores to sleep while waiting for previous cores to finish their atomic access. As a scalable implementation of LRwait, we present Colibri, a distributed and scalable approach to managing LRwait reservations. Through extensive benchmarking on an open-source RISC-V platform with 256 cores, we demonstrate that Colibri outperforms current synchronization approaches for various concurrent algorithms with high and low contention regarding throughput, fairness, and energy efficiency. With an area overhead of only 6%, Colibri outperforms LRSC-based implementations by a factor of 6.5x in terms of throughput and 7.1x in terms of energy efficiency.
Paper Structure (23 sections, 6 figures, 2 tables)

This paper contains 23 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Difference between architecture with a reservation table, LRSCwait$_{}$ with a reservation queue, and Colibri with a linked-list-like structure.
  • Figure 2: LRwait and SCwait sequence in Colibri with two cores and one queue.
  • Figure 3: Throughput of different LRSCwait$_{}$ implementations and standard RISC/̄V atomics at varying contention.
  • Figure 4: Throughput of different lock implementations compared to generic atomics at varying contention.
  • Figure 5: Matrix multiplication performance with interference from atomics. The poller-to-worker ratio is annotated in the figure with poller:worker.
  • ...and 1 more figures