Table of Contents
Fetching ...

Wukong: Towards a Scaling Law for Large-Scale Recommendation

Buyun Zhang, Liang Luo, Yuxin Chen, Jade Nie, Xi Liu, Daifeng Guo, Yanli Zhao, Shen Li, Yuchen Hao, Yantao Yao, Guna Lakshminarayanan, Ellie Dingqiao Wen, Jongsoo Park, Maxim Naumov, Wenlin Chen

TL;DR

The paper addresses the absence of a scalable scaling law in deep learning-based recommender systems by introducing Wukong, a dense-interaction architecture built from stacked Factorization Machine Blocks to capture arbitrary-order feature interactions with efficient low-rank projections. It demonstrates a dense upscaling strategy that yields consistent quality gains across six public datasets and a large internal dataset, outperforming state-of-the-art baselines while maintaining stable scaling across two orders of magnitude in compute and model size. Through ablation studies and comparisons to transformer-based approaches, the authors highlight the essential roles of FMB, LCB, and residual connections, and discuss engineering strategies for large-scale training and serving. The findings suggest Wukong can serve as a scalable backbone for recommendation systems, enabling both compact deployment and large foundational-model-like capabilities with practical efficiency.

Abstract

Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short.

Wukong: Towards a Scaling Law for Large-Scale Recommendation

TL;DR

The paper addresses the absence of a scalable scaling law in deep learning-based recommender systems by introducing Wukong, a dense-interaction architecture built from stacked Factorization Machine Blocks to capture arbitrary-order feature interactions with efficient low-rank projections. It demonstrates a dense upscaling strategy that yields consistent quality gains across six public datasets and a large internal dataset, outperforming state-of-the-art baselines while maintaining stable scaling across two orders of magnitude in compute and model size. Through ablation studies and comparisons to transformer-based approaches, the authors highlight the essential roles of FMB, LCB, and residual connections, and discuss engineering strategies for large-scale training and serving. The findings suggest Wukong can serve as a scalable backbone for recommendation systems, enabling both compact deployment and large foundational-model-like capabilities with practical efficiency.

Abstract

Scaling laws play an instrumental role in the sustainable improvement in model quality. Unfortunately, recommendation models to date do not exhibit such laws similar to those observed in the domain of large language models, due to the inefficiencies of their upscaling mechanisms. This limitation poses significant challenges in adapting these models to increasingly more complex real-world datasets. In this paper, we propose an effective network architecture based purely on stacked factorization machines, and a synergistic upscaling strategy, collectively dubbed Wukong, to establish a scaling law in the domain of recommendation. Wukong's unique design makes it possible to capture diverse, any-order of interactions simply through taller and wider layers. We conducted extensive evaluations on six public datasets, and our results demonstrate that Wukong consistently outperforms state-of-the-art models quality-wise. Further, we assessed Wukong's scalability on an internal, large-scale dataset. The results show that Wukong retains its superiority in quality over state-of-the-art models, while holding the scaling law across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example, where prior arts fall short.
Paper Structure (37 sections, 5 equations, 6 figures, 5 tables)

This paper contains 37 sections, 5 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Wukong outperforms existing state-of-the-art models while demonstrating a scaling law in the recommendation domain across two orders of magnitude in model complexity, extending beyond 100 GFLOP/example.
  • Figure 2: Wukong employs an interaction stack to capture feature interactions. Each layer in the stack consists of a Factorization Machine Block and a Linear Compress Block.
  • Figure 3: Scalability of Wukong with respect to # parameters on the internal dataset.
  • Figure 4: Significance of individual components.
  • Figure 5: Impact of scaling individual components.
  • ...and 1 more figures