Table of Contents
Fetching ...

MSN: A Memory-based Sparse Activation Scaling Framework for Large-scale Industrial Recommendation

Shikang Wu, Hui Lu, Jinqiu Jin, Zheng Chai, Shiyong Hong, Junjie Zhang, Shanlei Mu, Kaiyuan Ma, Tianyi Liu, Yuchao Zheng, Zhe Wang, Jingjian Lin

TL;DR

MSN tackles the efficiency-personalization trade-off in scaling DLRMs for industrial recommendations by replacing a large set of shared experts with a large memory $\mathcal{V}$ and a Product-Key Memory (PKM) retrieval that reduces cost from $O(nd)$ to $O(\sqrt{n}\,d)$. A memory gating mechanism then fuses retrieved values into downstream feature interactions, while layer normalization and over-parameterization ensure stable optimization; Sparse-Gather and AirTopK improve training and inference efficiency. Across offline Douyin data and online A/B tests, MSN yields consistent QAUC gains and improvements in engagement metrics, validating the approach at a billion-scale system. This work provides a scalable, plug-in memory-based scaling paradigm for large-scale recommender systems and suggests future directions combining memory with other sparsity strategies.

Abstract

Scaling deep learning recommendation models is an effective way to improve model expressiveness. Existing approaches often incur substantial computational overhead, making them difficult to deploy in large-scale industrial systems under strict latency constraints. Recent sparse activation scaling methods, such as Sparse Mixture-of-Experts, reduce computation by activating only a subset of parameters, but still suffer from high memory access costs and limited personalization capacity due to the large size and small number of experts. To address these challenges, we propose MSN, a memory-based sparse activation scaling framework for recommendation models. MSN dynamically retrieves personalized representations from a large parameterized memory and integrates them into downstream feature interaction modules via a memory gating mechanism, enabling fine-grained personalization with low computational overhead. To enable further expansion of the memory capacity while keeping both computational and memory access costs under control, MSN adopts a Product-Key Memory (PKM) mechanism, which factorizes the memory retrieval complexity from linear time to sub-linear complexity. In addition, normalization and over-parameterization techniques are introduced to maintain balanced memory utilization and prevent memory retrieval collapse. We further design customized Sparse-Gather operator and adopt the AirTopK operator to improve training and inference efficiency in industrial settings. Extensive experiments demonstrate that MSN consistently improves recommendation performance while maintaining high efficiency. Moreover, MSN has been successfully deployed in the Douyin Search Ranking System, achieving significant gains over deployed state-of-the-art models in both offline evaluation metrics and large-scale online A/B test.

MSN: A Memory-based Sparse Activation Scaling Framework for Large-scale Industrial Recommendation

TL;DR

MSN tackles the efficiency-personalization trade-off in scaling DLRMs for industrial recommendations by replacing a large set of shared experts with a large memory and a Product-Key Memory (PKM) retrieval that reduces cost from to . A memory gating mechanism then fuses retrieved values into downstream feature interactions, while layer normalization and over-parameterization ensure stable optimization; Sparse-Gather and AirTopK improve training and inference efficiency. Across offline Douyin data and online A/B tests, MSN yields consistent QAUC gains and improvements in engagement metrics, validating the approach at a billion-scale system. This work provides a scalable, plug-in memory-based scaling paradigm for large-scale recommender systems and suggests future directions combining memory with other sparsity strategies.

Abstract

Scaling deep learning recommendation models is an effective way to improve model expressiveness. Existing approaches often incur substantial computational overhead, making them difficult to deploy in large-scale industrial systems under strict latency constraints. Recent sparse activation scaling methods, such as Sparse Mixture-of-Experts, reduce computation by activating only a subset of parameters, but still suffer from high memory access costs and limited personalization capacity due to the large size and small number of experts. To address these challenges, we propose MSN, a memory-based sparse activation scaling framework for recommendation models. MSN dynamically retrieves personalized representations from a large parameterized memory and integrates them into downstream feature interaction modules via a memory gating mechanism, enabling fine-grained personalization with low computational overhead. To enable further expansion of the memory capacity while keeping both computational and memory access costs under control, MSN adopts a Product-Key Memory (PKM) mechanism, which factorizes the memory retrieval complexity from linear time to sub-linear complexity. In addition, normalization and over-parameterization techniques are introduced to maintain balanced memory utilization and prevent memory retrieval collapse. We further design customized Sparse-Gather operator and adopt the AirTopK operator to improve training and inference efficiency in industrial settings. Extensive experiments demonstrate that MSN consistently improves recommendation performance while maintaining high efficiency. Moreover, MSN has been successfully deployed in the Douyin Search Ranking System, achieving significant gains over deployed state-of-the-art models in both offline evaluation metrics and large-scale online A/B test.
Paper Structure (28 sections, 13 equations, 4 figures, 3 tables)

This paper contains 28 sections, 13 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overall framework of MSN.
  • Figure 2: Comparison with 2-layer FFN and MSN.
  • Figure 3: The overall architecture of the backbone model.
  • Figure 4: The distribution of activated memory values.