HeteroPod: XPU-Accelerated Infrastructure Offloading for Commodity Cloud-Native Applications
Bicheng Yang, Jingkai He, Dong Du, Yubin Xia, Haibo Chen
TL;DR
HeteroPod introduces a dynamic cross-PU offload approach that moves cloud-native infra containers from CPU to DPUs to reduce infra-burden while preserving Pod semantics. It furnishes HeteroNet, a cross-PU networking substrate based on a split network namespace and a kernel-co-designed, kernel-bypass-friendly user-space stack, enabling high-performance communication across CPUs and DPUs. Through HeteroK8s, the authors demonstrate substantial performance and scalability gains across service mesh, serverless, and scheduling workloads on real DPUs and CXL-based setups, including dramatic latency reductions and resource savings versus state-of-the-art approaches. The work provides a practical path toward denser, more isolated, and cost-efficient cloud-native deployments with open-source tooling for broader adoption.
Abstract
Cloud-native systems increasingly rely on infrastructure services (e.g., service meshes, monitoring agents), which compete for resources with user applications, degrading performance and scalability. We propose HeteroPod, a new abstraction that offloads these services to Data Processing Units (DPUs) to enforce strict isolation while reducing host resource contention and operational costs. To realize HeteroPod, we introduce HeteroNet, a cross-PU (XPU) network system featuring: (1) split network namespace, a unified network abstraction for processes spanning CPU and DPU, and (2) elastic and efficient XPU networking, a communication mechanism achieving shared-memory performance without pinned resource overhead and polling costs. By leveraging HeteroNet and the compositional nature of cloud-native workloads, HeteroPod can optimally offload infrastructure containers to DPUs. We implement HeteroNet based on Linux, and implement a cloud-native system called HeteroK8s based on Kubernetes. We evaluate the systems using NVIDIA Bluefield-2 DPUs and CXL-based DPUs (simulated with real CXL memory devices). The results show that HeteroK8s effectively supports complex (unmodified) commodity cloud-native applications (up to 1 million LoC) and provides up to 31.9x better latency and 64x less resource consumption (compared with kernel-bypass design), 60% better end-to-end latency, and 55% higher scalability compared with SOTA systems.
