SCRec: A Scalable Computational Storage System with Statistical Sharding and Tensor-train Decomposition for Recommendation Models
Jinho Yang, Ji-Hoon Kim, Joo-Young Kim
TL;DR
SCRec tackles TB-scale embedding and bandwidth challenges in deep learning recommendation models by marrying a software framework with a mixed-integer programming cost model to adaptively map workloads to hardware accelerators on SmartSSDs. It introduces three-level statistical sharding and TT-format embedding to fit massive EMBs into on-chip memory and accelerate reconstruction, all within a single server to minimize data movement. The approach combines a Data Statistic Analyzer, Scalable Resource Manager, and Address Remapper with TT-accelerated EMB and MLP cores, achieving up to 55.77x faster inference and up to 13.35x energy efficiency gains over baseline multi-GPU systems, while maintaining accuracy. This work offers a practical, energy-efficient path to deploying industrial-scale DLRMs with near-data processing and hardware-assisted TT-embedding on a cost-effective server.
Abstract
Deep Learning Recommendation Models (DLRMs) play a crucial role in delivering personalized content across web applications such as social networking and video streaming. However, with improvements in performance, the parameter size of DLRMs has grown to terabyte (TB) scales, accompanied by memory bandwidth demands exceeding TB/s levels. Furthermore, the workload intensity within the model varies based on the target mechanism, making it difficult to build an optimized recommendation system. In this paper, we propose SCRec, a scalable computational storage recommendation system that can handle TB-scale industrial DLRMs while guaranteeing high bandwidth requirements. SCRec utilizes a software framework that features a mixed-integer programming (MIP)-based cost model, efficiently fetching data based on data access patterns and adaptively configuring memory-centric and compute-centric cores. Additionally, SCRec integrates hardware acceleration cores to enhance DLRM computations, particularly allowing for the high-performance reconstruction of approximated embedding vectors from extremely compressed tensor-train (TT) format. By combining its software framework and hardware accelerators, while eliminating data communication overhead by being implemented on a single server, SCRec achieves substantial improvements in DLRM inference performance. It delivers up to 55.77$\times$ speedup compared to a CPU-DRAM system with no loss in accuracy and up to 13.35$\times$ energy efficiency gains over a multi-GPU system.
