Table of Contents
Fetching ...

Bending the Scaling Law Curve in Large-Scale Recommendation Systems

Qin Ding, Kevin Course, Linjian Ma, Jianhui Sun, Ruochen Liu, Zhao Zhu, Chunxing Yin, Wei Li, Dai Li, Yu Shi, Xuan Cao, Ze Yang, Han Li, Xing Liu, Bi Xue, Hongwei Li, Rui Jian, Daisy Shi He, Jing Qian, Matt Ma, Qunshu Zhang, Rui Li

TL;DR

LTRA-HSTU is presented, a novel sequential recommendation model developed through end-to-end model and system co-design that achieves substantial improvements in both model quality and efficiency and is fully deployed at scale.

Abstract

Learning from user interaction history through sequential models has become a cornerstone of large-scale recommender systems. Recent advances in large language models have revealed promising scaling laws, sparking a surge of research into long-sequence modeling and deeper architectures for recommendation tasks. However, many recent approaches rely heavily on cross-attention mechanisms to address the quadratic computational bottleneck in sequential modeling, which can limit the representational power gained from self-attention. We present ULTRA-HSTU, a novel sequential recommendation model developed through end-to-end model and system co-design. By innovating in the design of input sequences, sparse attention mechanisms, and model topology, ULTRA-HSTU achieves substantial improvements in both model quality and efficiency. Comprehensive benchmarking demonstrates that ULTRA-HSTU achieves remarkable scaling efficiency gains -- over 5x faster training scaling and 21x faster inference scaling compared to conventional models -- while delivering superior recommendation quality. Our solution is fully deployed at scale, serving billions of users daily and driving significant 4% to 8% consumption and engagement improvements in real-world production environments.

Bending the Scaling Law Curve in Large-Scale Recommendation Systems

TL;DR

LTRA-HSTU is presented, a novel sequential recommendation model developed through end-to-end model and system co-design that achieves substantial improvements in both model quality and efficiency and is fully deployed at scale.

Abstract

Learning from user interaction history through sequential models has become a cornerstone of large-scale recommender systems. Recent advances in large language models have revealed promising scaling laws, sparking a surge of research into long-sequence modeling and deeper architectures for recommendation tasks. However, many recent approaches rely heavily on cross-attention mechanisms to address the quadratic computational bottleneck in sequential modeling, which can limit the representational power gained from self-attention. We present ULTRA-HSTU, a novel sequential recommendation model developed through end-to-end model and system co-design. By innovating in the design of input sequences, sparse attention mechanisms, and model topology, ULTRA-HSTU achieves substantial improvements in both model quality and efficiency. Comprehensive benchmarking demonstrates that ULTRA-HSTU achieves remarkable scaling efficiency gains -- over 5x faster training scaling and 21x faster inference scaling compared to conventional models -- while delivering superior recommendation quality. Our solution is fully deployed at scale, serving billions of users daily and driving significant 4% to 8% consumption and engagement improvements in real-world production environments.
Paper Structure (38 sections, 8 equations, 8 figures, 11 tables)

This paper contains 38 sections, 8 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Overall Performance: Scaling performance with respect to train (left) and inference (right) FLOP. Compared to vanilla HSTU, ULTRA-HSTU has more than 5.3$\times$ training scaling efficiency and 21.4$\times$ inference scaling efficiency.
  • Figure 2: Model design overview: (a) General recommendation model design. (b) Input sequence optimizations with action-aware designs (c) Semi-local attention mask with linear complexity (d) Attention truncation for dynamic topological designs.
  • Figure 3: Attentions masks. Left plot: full causal self-attention masks. Right plot: Semi-local attention masks.
  • Figure 4: Mixed precision computation framework. We fuse the scaling/quantization steps with the preceding kernels.
  • Figure 5: Ablation study on scaling: training (left) and inference (middle) of SLA, inference (right) of attention truncation (AT).
  • ...and 3 more figures