Table of Contents
Fetching ...

COREC: Concurrent Non-Blocking Single-Queue Receive Driver for Low Latency Networking

Marco Faltelli, Giacomo Belocchi, Francesco Quaglia, Giuseppe Bianchi

TL;DR

COREC addresses tail latency in data-center networks by enabling multiple threads to concurrently process a single shared Rx queue without locks. It achieves this via a non-blocking coordination mechanism based on Read-Modify-Write (RMW) primitives and Compare-And-Swap (CAS), partitioning descriptor batches among threads and preserving NIC transparency. Implemented in DPDK v21.11 across ixgbe, i40e, and ice drivers, COREC demonstrates non-critical reordering and notable latency reductions, especially for UDP traffic and short TCP flows, with worst-case overheads around 2–3%. The approach offers a scalable, work-conserving alternative to conventional per-queue single-threaded processing, and holds promise for extension to kernel-space frameworks (Linux, XDP, RDMA), potentially reshaping network driver design for low-latency, high-core-count environments.

Abstract

Existing network stacks tackle performance and scalability aspects by relying on multiple receive queues. However, at software level, each queue is processed by a single thread, which prevents simultaneous work on the same queue and limits performance in terms of tail latency. To overcome this limitation, we introduce COREC, the first software implementation of a concurrent non-blocking single-queue receive driver. By sharing a single queue among multiple threads, workload distribution is improved, leading to a work-conserving policy for network stacks. On the technical side, instead of relying on traditional critical sections - which would sequentialize the operations by threads - COREC coordinates the threads that concurrently access the same receive queue in non-blocking manner via atomic machine instructions from the Read-Modify-Write (RMW) class. These instructions allow threads to access and update memory locations atomically, based on specific conditions, such as the matching of a target value selected by the thread. Also, they enable making any update globally visible in the memory hierarchy, bypassing interference on memory consistency caused by the CPU store buffers. Extensive evaluation results demonstrate that the possible additional reordering, which our approach may occasionally cause, is non-critical and has minimal impact on performance, even in the worst-case scenario of a single large TCP flow, with performance impairments accounting to at most 2-3 percent. Conversely, substantial latency gains are achieved when handling UDP traffic, real-world traffic mix, and multiple shorter TCP flows.

COREC: Concurrent Non-Blocking Single-Queue Receive Driver for Low Latency Networking

TL;DR

COREC addresses tail latency in data-center networks by enabling multiple threads to concurrently process a single shared Rx queue without locks. It achieves this via a non-blocking coordination mechanism based on Read-Modify-Write (RMW) primitives and Compare-And-Swap (CAS), partitioning descriptor batches among threads and preserving NIC transparency. Implemented in DPDK v21.11 across ixgbe, i40e, and ice drivers, COREC demonstrates non-critical reordering and notable latency reductions, especially for UDP traffic and short TCP flows, with worst-case overheads around 2–3%. The approach offers a scalable, work-conserving alternative to conventional per-queue single-threaded processing, and holds promise for extension to kernel-space frameworks (Linux, XDP, RDMA), potentially reshaping network driver design for low-latency, high-core-count environments.

Abstract

Existing network stacks tackle performance and scalability aspects by relying on multiple receive queues. However, at software level, each queue is processed by a single thread, which prevents simultaneous work on the same queue and limits performance in terms of tail latency. To overcome this limitation, we introduce COREC, the first software implementation of a concurrent non-blocking single-queue receive driver. By sharing a single queue among multiple threads, workload distribution is improved, leading to a work-conserving policy for network stacks. On the technical side, instead of relying on traditional critical sections - which would sequentialize the operations by threads - COREC coordinates the threads that concurrently access the same receive queue in non-blocking manner via atomic machine instructions from the Read-Modify-Write (RMW) class. These instructions allow threads to access and update memory locations atomically, based on specific conditions, such as the matching of a target value selected by the thread. Also, they enable making any update globally visible in the memory hierarchy, bypassing interference on memory consistency caused by the CPU store buffers. Extensive evaluation results demonstrate that the possible additional reordering, which our approach may occasionally cause, is non-critical and has minimal impact on performance, even in the worst-case scenario of a single large TCP flow, with performance impairments accounting to at most 2-3 percent. Conversely, substantial latency gains are achieved when handling UDP traffic, real-world traffic mix, and multiple shorter TCP flows.
Paper Structure (22 sections, 22 figures, 5 tables)

This paper contains 22 sections, 22 figures, 5 tables.

Figures (22)

  • Figure 1: A typical ring buffer
  • Figure 2: Scale-out (NxM/M/1) vs. scale-up (M/M/N) policy
  • Figure 3: Mean latency - 4 cores
  • Figure 4: 99p latency - 4 cores
  • Figure 5: Mean latency - 8 cores
  • ...and 17 more figures