Table of Contents
Fetching ...

GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines

Mike Wong, Ulysses Butler, Emma Farkash, Praveen Tammana, Anirudh Sivaraman, Ravi Netravali

TL;DR

This paper tackles inefficiencies in complex AI inference pipelines by proposing to offload on-path data-processing tasks to SmartNICs, thereby mitigating CPU/GPU contention and reducing latency. It develops a taxonomy of offloadable tasks, arguing that data transformations and formatting operations around inferences are best suited for network-based acceleration, and it analyzes core challenges such as finite packet context, compute overheads, memory limits, and the need for parallelism. The authors present concrete offload examples—image normalization, bilinear interpolation, and tokenization—with design strategies including per-channel lookup tables, tile- and row-based serialization, and overlap-based tokenization, along with practical memory/throughput considerations. They also propose an automatic compilation roadmap that would map pipeline specifications to NIC implementations and coordinate with distributed schedulers, opening a path to integrating network hardware into AI serving runtimes for improved latency and resource utilization.

Abstract

The increasing prominence of AI necessitates the deployment of inference platforms for efficient and effective management of AI pipelines and compute resources. As these pipelines grow in complexity, the demand for distributed serving rises and introduces much-dreaded network delays. In this paper, we investigate how the network can instead be a boon to the excessively high resource overheads of AI pipelines. To alleviate these overheads, we discuss how resource-intensive data processing tasks -- a key facet of growing AI pipeline complexity -- are well-matched for the computational characteristics of packet processing pipelines and how they can be offloaded onto SmartNICs. We explore the challenges and opportunities of offloading, and propose a research agenda for integrating network hardware into AI pipelines, unlocking new opportunities for optimization.

GPUs, CPUs, and... NICs: Rethinking the Network's Role in Serving Complex AI Pipelines

TL;DR

This paper tackles inefficiencies in complex AI inference pipelines by proposing to offload on-path data-processing tasks to SmartNICs, thereby mitigating CPU/GPU contention and reducing latency. It develops a taxonomy of offloadable tasks, arguing that data transformations and formatting operations around inferences are best suited for network-based acceleration, and it analyzes core challenges such as finite packet context, compute overheads, memory limits, and the need for parallelism. The authors present concrete offload examples—image normalization, bilinear interpolation, and tokenization—with design strategies including per-channel lookup tables, tile- and row-based serialization, and overlap-based tokenization, along with practical memory/throughput considerations. They also propose an automatic compilation roadmap that would map pipeline specifications to NIC implementations and coordinate with distributed schedulers, opening a path to integrating network hardware into AI serving runtimes for improved latency and resource utilization.

Abstract

The increasing prominence of AI necessitates the deployment of inference platforms for efficient and effective management of AI pipelines and compute resources. As these pipelines grow in complexity, the demand for distributed serving rises and introduces much-dreaded network delays. In this paper, we investigate how the network can instead be a boon to the excessively high resource overheads of AI pipelines. To alleviate these overheads, we discuss how resource-intensive data processing tasks -- a key facet of growing AI pipeline complexity -- are well-matched for the computational characteristics of packet processing pipelines and how they can be offloaded onto SmartNICs. We explore the challenges and opportunities of offloading, and propose a research agenda for integrating network hardware into AI pipelines, unlocking new opportunities for optimization.

Paper Structure

This paper contains 14 sections, 4 equations.