Table of Contents
Fetching ...

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

Bojian Hou, Xiaolong Liu, Xiaoyi Liu, Jiaqi Xu, Yasmine Badr, Mengyue Hang, Sudhanshu Chanpuriya, Junqing Zhou, Yuhang Yang, Han Xu, Qiuling Suo, Laming Chen, Yuxi Hu, Jiasheng Zhang, Huaqing Xiong, Yuzhen Huang, Chao Chen, Yue Dong, Yi Yang, Shuo Chang, Xiaorui Gan, Wenlin Chen, Santanu Kolay, Darren Liu, Jade Nie, Chunzhi Yang, Jiyan Yang, Huayu Li

TL;DR

Kunlun addresses the challenge of establishing predictable scaling laws for massive-scale recommender systems that jointly model sequential user behavior and heterogeneous context features. It achieves this through a two-level model-efficiency co-design: low-level optimizations (Generalized Dot-Product Attention, Hierarchical Seed Pooling, Sliding Window Attention) and high-level computation reallocation (Computation Skip, Event-Level Personalization, Mixture of Wukong Experts). The approach yields substantial efficiency and scaling gains, elevating Model FLOPs Utilization from 17% to 37% on NVIDIA B200 GPUs and delivering approximately 2× scaling efficiency over prior methods, with production deployment in Meta Ads showing measurable topline impact. By demonstrating predictable power-law scaling for joint sequence-context modeling and validating it with large-scale experiments and production results, the work provides a practical framework for scaling outbound CTR models at industrial scale.

Abstract

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.

Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design

TL;DR

Kunlun addresses the challenge of establishing predictable scaling laws for massive-scale recommender systems that jointly model sequential user behavior and heterogeneous context features. It achieves this through a two-level model-efficiency co-design: low-level optimizations (Generalized Dot-Product Attention, Hierarchical Seed Pooling, Sliding Window Attention) and high-level computation reallocation (Computation Skip, Event-Level Personalization, Mixture of Wukong Experts). The approach yields substantial efficiency and scaling gains, elevating Model FLOPs Utilization from 17% to 37% on NVIDIA B200 GPUs and delivering approximately 2× scaling efficiency over prior methods, with production deployment in Meta Ads showing measurable topline impact. By demonstrating predictable power-law scaling for joint sequence-context modeling and validating it with large-scale experiments and production results, the work provides a practical framework for scaling outbound CTR models at industrial scale.

Abstract

Deriving predictable scaling laws that govern the relationship between model performance and computational investment is crucial for designing and allocating resources in massive-scale recommendation systems. While such laws are established for large language models, they remain challenging for recommendation systems, especially those processing both user history and context features. We identify poor scaling efficiency as the main barrier to predictable power-law scaling, stemming from inefficient modules with low Model FLOPs Utilization (MFU) and suboptimal resource allocation. We introduce Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation. Our low-level optimizations include Generalized Dot-Product Attention (GDPA), Hierarchical Seed Pooling (HSP), and Sliding Window Attention. Our high-level innovations feature Computation Skip (CompSkip) and Event-level Personalization. These advances increase MFU from 17% to 37% on NVIDIA B200 GPUs and double scaling efficiency over state-of-the-art methods. Kunlun is now deployed in major Meta Ads models, delivering significant production impact.
Paper Structure (44 sections, 26 equations, 4 figures, 3 tables, 4 algorithms)

This paper contains 44 sections, 26 equations, 4 figures, 3 tables, 4 algorithms.

Figures (4)

  • Figure 1: Overview of the Kunlun architecture. The model is composed of multiple stacked layers, and each layer includes two main components: (1) a Kunlun Transformer block, which incorporates GDPA-enhanced PFFN and Multi-Head Self-Attention (MHA) to enable context-aware sequence modeling; and (2) a Kunlun Interaction block, which contains a Weight Generation module (to derive personalized weights for the PFFN from non-sequential features), a HSP module (to efficiently summarize sequential information for subsequent global interaction), and a Global Interaction module that facilitates interactions between sequential and non-sequential inputs, as well as interactions within the non-sequential features themselves.
  • Figure 2: Comparison between (a) the original PFFN, and (b) our GDPA-enhanced PFFN. Note: Both are one-block demos.
  • Figure 3: NE gains of different architectures compared to Wukong baseline across computational scales (6, 60, and 180 GFLOPs). The y-axis shows the absolute NE gains, where lower NE indicates better model performance. Larger values represent greater improvement over Wukong. Kunlun achieves the largest NE gains at all scales (0.31%, 0.66%, 0.79%), with the performance gap widening as computational budget increases, demonstrating superior scaling efficiency. Note: NEs are not comparable across scales due to different feature configurations; within-scale results use identical setups.
  • Figure 4: (Left) Scaling law curves showing NE improvement vs. Total Compute. Kunlun achieves steeper scaling (2$\times$ efficiency over state-of-the-art) with predictable power-law behavior. (Right) NE improvement as a function of number of layers (1-6). Each additional layer provides diminishing but predictable returns following logarithmic scaling.