PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Pingyi Huo; Anusha Devulapally; Hasan Al Maruf; Minseo Park; Krishnakumar Nair; Meena Arunachalam; Gulsum Gudukbay Akbulut; Mahmut Taylan Kandemir; Vijaykrishnan Narayanan

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Pingyi Huo, Anusha Devulapally, Hasan Al Maruf, Minseo Park, Krishnakumar Nair, Meena Arunachalam, Gulsum Gudukbay Akbulut, Mahmut Taylan Kandemir, Vijaykrishnan Narayanan

TL;DR

This work addresses the bandwidth bottlenecks of large-scale DLRMs by embedding near-data processing inside a CXL fabric switch through the PIFS-Rec architecture. It combines hardware innovations (a lightweight processing core, on-switch buffer, and accelerated accumulation) with software strategies (page-granular management, global hotness detection, and embedding spreading) to exploit memory bandwidth across multi-node CXL fabrics. Key results show latency reductions of $3.89\times$ over Pond and $2.03\times$ over BEACON, along with favorable TCO and energy efficiency, demonstrating strong potential for scalable, memory-bandwidth-bound inference workloads. The approach offers broad applicability beyond DLRMs by providing a near-data compute framework tightly integrated with CXL fabric switches, enabling scalable acceleration for embedding-heavy models in datacenters.

Abstract

Deep Learning Recommendation Models (DLRMs) have become increasingly popular and prevalent in today's datacenters, consuming most of the AI inference cycles. The performance of DLRMs is heavily influenced by available bandwidth due to their large vector sizes in embedding tables and concurrent accesses. To achieve substantial improvements over existing solutions, novel approaches towards DLRM optimization are needed, especially, in the context of emerging interconnect technologies like CXL. This study delves into exploring CXL-enabled systems, implementing a process-in-fabric-switch (PIFS) solution to accelerate DLRMs while optimizing their memory and bandwidth scalability. We present an in-depth characterization of industry-scale DLRM workloads running on CXL-ready systems, identifying the predominant bottlenecks in existing CXL systems. We, therefore, propose PIFS-Rec, a PIFS-based scheme that implements near-data processing through downstream ports of the fabric switch. PIFS-Rec achieves a latency that is 3.89x lower than Pond, an industry-standard CXL-based system, and also outperforms BEACON, a state-of-the-art scheme, by 2.03x.

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

TL;DR

over Pond and

over BEACON, along with favorable TCO and energy efficiency, demonstrating strong potential for scalable, memory-bandwidth-bound inference workloads. The approach offers broad applicability beyond DLRMs by providing a near-data compute framework tightly integrated with CXL fabric switches, enabling scalable acceleration for embedding-heavy models in datacenters.

Abstract

Paper Structure (38 sections, 18 figures, 3 tables)

This paper contains 38 sections, 18 figures, 3 tables.

Introduction
Background and Related Works
Deep Learning Recommendation Model (DLRM)
CXL Overview
Conventional CXL
Fabric Switch
Related Works
Characterization Study and Motivation
System Design
Hardware Architecture
System Overview
Process Flow
Instruction Modification
On-Switch Buffer
Out-of-Order Accumulation
...and 23 more sections

Figures (18)

Figure 1: End-to-end DLRM pipeline for inference.
Figure 2: Architecture of a CXL-based system. The devices use Flexbus to communicate with the host. The fabric manager configures the Virtual PCI-to-PCI Bridge (VPPB) to control the FM endpoints in the fabric switches. These switches connect all devices within the system. Data leaves fabric switch through PCI-to-PCI Bridge (PPB).
Figure 3: A simplified illustration of the production-ready CXL-enabled experiment platform.
Figure 4: (a) Batch Threading -- each batch is assigned to a CPU core to be processed. (b) Table Threading -- each embedding table is accessed by a CPU core to be processed.
Figure 5: The X-axis indicates the embedding table size and Y-axis indicates the normalized application bandwidth. (a)-(b) The addition of CPU sockets can address the scale-up issue of memory-bound embedding table lookup operations at the cost of high-performance overhead. (c)-(d) CXL memory can provide better performance over remote CPU sockets. However, simply replacing CPU-attached memory with CXL memory causes performance overheads during high memory traffic over CXL. (e)-(f) Software interleaving during page allocation improves performance through CXL’s bandwidth expansion.
...and 13 more figures

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

TL;DR

Abstract

PIFS-Rec: Process-In-Fabric-Switch for Large-Scale Recommendation System Inferences

Authors

TL;DR

Abstract

Table of Contents

Figures (18)