Table of Contents
Fetching ...

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors

Reese Kuper, Ipoom Jeong, Yifan Yuan, Jiayu Hu, Ren Wang, Narayan Ranganathan, Nam Sung Kim

TL;DR

The paper addresses datacenter waste from memory-movement overhead by evaluating Intel's Data Streaming Accelerator (DSA) on Sapphire Rapids. It combines microbenchmark analyses, ecosystem development, and a DPDK VirtIO case study to quantify throughput, latency, and cache interactions, and to derive practical guidelines. Key findings show that DSA can deliver substantial throughput gains, reduce CPU-cycle waste through asynchronous and batched offloads, and mitigate cache pollution while supporting virtualization and memory-tiering. The work demonstrates practical impact through real workloads and software stacks, underscoring DSA's potential to accelerate data movement across DRAM, CXL memory, and beyond in modern datacenters.

Abstract

As semiconductor power density is no longer constant with the technology process scaling down, modern CPUs are integrating capable data accelerators on chip, aiming to improve performance and efficiency for a wide range of applications and usages. One such accelerator is the Intel Data Streaming Accelerator (DSA) introduced in Intel 4th Generation Xeon Scalable CPUs (Sapphire Rapids). DSA targets data movement operations in memory that are common sources of overhead in datacenter workloads and infrastructure. In addition, it becomes much more versatile by supporting a wider range of operations on streaming data, such as CRC32 calculations, delta record creation/merging, and data integrity field (DIF) operations. This paper sets out to introduce the latest features supported by DSA, deep-dive into its versatility, and analyze its throughput benefits through a comprehensive evaluation. Along with the analysis of its characteristics, and the rich software ecosystem of DSA, we summarize several insights and guidelines for the programmer to make the most out of DSA, and use an in-depth case study of DPDK Vhost to demonstrate how these guidelines benefit a real application.

A Quantitative Analysis and Guidelines of Data Streaming Accelerator in Modern Intel Xeon Scalable Processors

TL;DR

The paper addresses datacenter waste from memory-movement overhead by evaluating Intel's Data Streaming Accelerator (DSA) on Sapphire Rapids. It combines microbenchmark analyses, ecosystem development, and a DPDK VirtIO case study to quantify throughput, latency, and cache interactions, and to derive practical guidelines. Key findings show that DSA can deliver substantial throughput gains, reduce CPU-cycle waste through asynchronous and batched offloads, and mitigate cache pollution while supporting virtualization and memory-tiering. The work demonstrates practical impact through real workloads and software stacks, underscoring DSA's potential to accelerate data movement across DRAM, CXL memory, and beyond in modern datacenters.

Abstract

As semiconductor power density is no longer constant with the technology process scaling down, modern CPUs are integrating capable data accelerators on chip, aiming to improve performance and efficiency for a wide range of applications and usages. One such accelerator is the Intel Data Streaming Accelerator (DSA) introduced in Intel 4th Generation Xeon Scalable CPUs (Sapphire Rapids). DSA targets data movement operations in memory that are common sources of overhead in datacenter workloads and infrastructure. In addition, it becomes much more versatile by supporting a wider range of operations on streaming data, such as CRC32 calculations, delta record creation/merging, and data integrity field (DIF) operations. This paper sets out to introduce the latest features supported by DSA, deep-dive into its versatility, and analyze its throughput benefits through a comprehensive evaluation. Along with the analysis of its characteristics, and the rich software ecosystem of DSA, we summarize several insights and guidelines for the programmer to make the most out of DSA, and use an in-depth case study of DPDK Vhost to demonstrate how these guidelines benefit a real application.
Paper Structure (24 sections, 21 figures, 2 tables)

This paper contains 24 sections, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Architectural overview of DSA. Job descriptors are directly submitted to memory-mapped portals in each device. IOMMU allows in-device address translation, and thus memory pinning is not required.
  • Figure 2: Throughput improvements of data streaming operations over their software counterparts with varying transfer sizes (batch size: 1). Memory Fill and NT-Memory Fill refer to allocating and non-allocating writes (similar to regular store and nt-store), respectively.
  • Figure 3: Throughput of DSA's MemoryCopy operation on Sync or Async offloading with varying transfer sizes and batch sizes (BS)
  • Figure 4: Throughput of DSA's asynchronous MemoryCopy operation with different WQ sizes (WQS)
  • Figure 5: Breakdown of memcpy() latency on CPU (left bars) and MemoryCopy operation latency on DSA (right stacked bars) with varying batch sizes (transfer size: 4 KB)
  • ...and 16 more figures