Table of Contents
Fetching ...

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

Kayhan Behdin, Ata Fatahibaarzi, Qingquan Song, Yun Dai, Aman Gupta, Zhipeng Wang, Shao Tang, Hejian Sang, Gregory Dexter, Sirou Zhu, Siyu Zhu, Tejas Dharamsi, Vignesh Kothapalli, Zhoutong Fu, Yihan Cao, Pin-Lun Hsu, Fedor Borisyuk, Natesh Pillai, Luke Simon, Rahul Mazumder

TL;DR

This work addresses the practicality of deploying large language models in recommendation systems by presenting a practical pipeline to scale down LLMs into efficient small language models (SLMs) using knowledge distillation and post-training compression. It demonstrates a two-stage distillation strategy, coupled with OSSCAR-based structured pruning and optional quantization, to obtain SLMs that maintain near-parity with much larger founders while delivering substantial latency and cost benefits. The paper provides detailed deployment lessons, including hardware optimizations, KV caching, and FP8/quantization strategies, validated on predictive ranking and reasoning tasks in a large-scale RecSys setting. Collectively, these results enable real-time, scalable, and cost-effective use of LLM capabilities in professional social networks, with concrete guidance for practitioners on training, pruning, quantization, and serving at scale.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

TL;DR

This work addresses the practicality of deploying large language models in recommendation systems by presenting a practical pipeline to scale down LLMs into efficient small language models (SLMs) using knowledge distillation and post-training compression. It demonstrates a two-stage distillation strategy, coupled with OSSCAR-based structured pruning and optional quantization, to obtain SLMs that maintain near-parity with much larger founders while delivering substantial latency and cost benefits. The paper provides detailed deployment lessons, including hardware optimizations, KV caching, and FP8/quantization strategies, validated on predictive ranking and reasoning tasks in a large-scale RecSys setting. Collectively, these results enable real-time, scalable, and cost-effective use of LLM capabilities in professional social networks, with concrete guidance for practitioners on training, pruning, quantization, and serving at scale.

Abstract

Large language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.

Paper Structure

This paper contains 22 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of the process of creating SLMs via distillation and compression.
  • Figure 2: Comparison of Distillation and SFT on the Foundation Model. Knowledge distillation consistently outperforms SFT by effectively leveraging teacher supervision to preserve and enhance performance.
  • Figure 3: P99 TTFT (ms) for various LLMs
  • Figure 4: Latency breakdown of a single Transformer block for pruned and unpruned models. At longer context sizes, attention is a bottleneck.
  • Figure 5: Comparison of one-shot pruning methods. The bars indicate the drop (in percentage points) relative to the full precision baseline. The pruned model is a 6.4B model (20% MLP pruning).