The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Marco Graziano

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Marco Graziano

Abstract

AI transport libraries move bytes efficiently, but they commonly assume that buffers are already correctly allocated, placed, shared, registered, and safe under completion and teardown pressure. This paper presents dmaplane, a Linux kernel module that makes this missing layer explicit as buffer orchestration. dmaplane exposes a stable kernel UAPI via /dev/dmaplane and composes ring-based command channels, DMA buffer lifecycle management, dma-buf export for cross-device sharing, a kernel-space RDMA engine, NUMA-aware allocation and verification, credit-based flow control, low-overhead observability, and GPU memory integration via PCIe BAR pinning. We evaluate orchestration sensitivity with measurements of NUMA cross-node penalties at DRAM scale, completion-safe flow control under sustained RDMA load, and GPU BAR mapping tiers versus cudaMemcpy. We also demonstrate end-to-end disaggregated inference by transferring KV-cache chunks between two machines using RDMA WRITE WITH IMMEDIATE and reconstructing tensor views on the receiver. RDMA measurements use Soft-RoCE; we distinguish measured results from provider-independent properties by construction.

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Abstract

Paper Structure (45 sections, 4 figures, 5 tables)

This paper contains 45 sections, 4 figures, 5 tables.

Introduction
Goals and non goals
Motivating workloads beyond disaggregated inference
Contributions
Paper organization
Background and Related Work
Kernel primitives used by dmaplane
Positioning and closest neighbors
Architecture and Design Invariants
Architecture overview
Locking model
Design invariants
Core Mechanisms
Channels and ring based dispatch
Buffer lifecycle, mmap, and dma-buf export
...and 30 more sections

Figures (4)

Figure 1: dmaplane block diagram.
Figure 2: dma-buf export and per importer attachment mapping.
Figure 3: RDMA resource hierarchy.
Figure 4: Timeline of send CQ credits and receive window credits in a WRITE WITH IMMEDIATE pipeline.

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Abstract

The DMA Streaming Framework: Kernel-Level Buffer Orchestration for High-Performance AI Data Paths

Authors

Abstract

Table of Contents

Figures (4)