Table of Contents
Fetching ...

SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads

Amin Mamandipoor, Huy Dinh Tran, Mohammad Alian

TL;DR

The paper tackles datacenter networking tax by proposing Simultaneous Data-delivery Threads (SDT), a per-core enhancement that co-locates data delivery with application processing using asymmetric resource partitioning to minimize interference. It distinguishes data delivery from data processing and leverages a software daemon with a custom STRP instruction to dynamically allocate microarchitectural resources, enabling near-run-to-completion data delivery while preserving processing throughput. Through full-system gem5 simulations and a DPDK-based benchmark, the approach yields substantial hardware savings—47.5% area and 66% power for a 20-core CMP—while incurring less than 10% network throughput penalty and maintaining at least 90% of beefy-core performance. This work provides a practical pathway for energy-efficient, scalable datacenter CMPs by rethinking NIC-to-CPU data movement with specialized, dynamically partitioned SDT resources.

Abstract

Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.

SDT: Cutting Datacenter Tax Through Simultaneous Data-Delivery Threads

TL;DR

The paper tackles datacenter networking tax by proposing Simultaneous Data-delivery Threads (SDT), a per-core enhancement that co-locates data delivery with application processing using asymmetric resource partitioning to minimize interference. It distinguishes data delivery from data processing and leverages a software daemon with a custom STRP instruction to dynamically allocate microarchitectural resources, enabling near-run-to-completion data delivery while preserving processing throughput. Through full-system gem5 simulations and a DPDK-based benchmark, the approach yields substantial hardware savings—47.5% area and 66% power for a 20-core CMP—while incurring less than 10% network throughput penalty and maintaining at least 90% of beefy-core performance. This work provides a practical pathway for energy-efficient, scalable datacenter CMPs by rethinking NIC-to-CPU data movement with specialized, dynamically partitioned SDT resources.

Abstract

Networking is considered a datacenter tax, and hyperscalers push hard to provide high-performance networking with minimal resource expenditure. To keep up with the ever-increasing network rates, many CPU cycles are spent on the networking tax. We make a key observation that network processing threads can be simultaneously executed on server CPUs with minimal interference with the application threads. However, utilizing simultaneous multithreading (SMT) to scale the number of network threads with the number of application threads suffers from (1) failing to provide strict tail latency requirements for latency-critical applications, and (2) reducing the number of available hardware threads for application processes, thus contributing to a high datacenter network tax. In this work, we design, implement, and evaluate a chip-multiprocessor (CMP) with specialized Simultaneous Data-delivery Threads (SDT) per physical core. The key insight is that with judicious partitioning at the architectural level, SDT can safely co-run with application processes with guaranteed performance isolation. Our evaluation results, using full-system simulation, show that a 20-core CMP enhanced with SDT reduces the area and power consumption of a baseline 40-core CMP by 47.5% and 66%, respectively, while reducing network throughput by less than 10%.

Paper Structure

This paper contains 6 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Scaling data network delivery bandwidth requires CPU cycles.
  • Figure 2: Overview of (a) run-to-completion and (b) pipeline applications datapath, (c) iperf's performance for different NUMA settings.
  • Figure 3: Sensitivity of data delivery thread to size of microarchitectural structures. The horizontal line is the 90% performance watermark.
  • Figure 4: SDT requirements for different levels of compute-intensity applications.