GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines
Mike Wong, Ulysses Butler, Emma Farkash, Praveen Tammana, Anirudh Sivaraman, Ravi Netravali
TL;DR
This paper tackles inefficiencies in complex AI inference pipelines by proposing to offload on-path data-processing tasks to SmartNICs, thereby mitigating CPU/GPU contention and reducing latency. It develops a taxonomy of offloadable tasks, arguing that data transformations and formatting operations around inferences are best suited for network-based acceleration, and it analyzes core challenges such as finite packet context, compute overheads, memory limits, and the need for parallelism. The authors present concrete offload examples—image normalization, bilinear interpolation, and tokenization—with design strategies including per-channel lookup tables, tile- and row-based serialization, and overlap-based tokenization, along with practical memory/throughput considerations. They also propose an automatic compilation roadmap that would map pipeline specifications to NIC implementations and coordinate with distributed schedulers, opening a path to integrating network hardware into AI serving runtimes for improved latency and resource utilization.
Abstract
The increasing prominence of AI necessitates the deployment of inference platforms for efficient and effective management of AI pipelines and compute resources. As these pipelines grow in complexity, the demand for distributed serving rises and introduces much-dreaded network delays. In this paper, we investigate how the network can instead be a boon to the excessively high resource overheads of AI pipelines. To alleviate these overheads, we discuss how resource-intensive data processing tasks -- a key facet of growing AI pipeline complexity -- are well-matched for the computational characteristics of packet processing pipelines and how they can be offloaded onto SmartNICs. We explore the challenges and opportunities of offloading, and propose a research agenda for integrating network hardware into AI pipelines, unlocking new opportunities for optimization.
