Table of Contents
Fetching ...

Compute Only Once: UG-Separation for Efficient Large Recommendation Models

Hui Lu, Zheng Chai, Shipeng Bai, Hao Zhang, Zhifang Fan, Kunmin Bai, Yingwen Wu, Bingzheng Wei, Xiang Sun, Ziyan Gong, Tianyi Liu, Hua Chen, Deping Xie, Zhongkai Chen, Zhiliang Guo, Qiwei Chen, Yuchao Zheng

TL;DR

Recommender systems increasingly rely on large-scale dense interaction models, which incurs prohibitive training and inference costs as feature interactions grow. The paper introduces UG-Sep, a masking-based User–Group Separation framework that disentangles user-side and item-side information flows, enabling reuse of user-side computations via a reusable PertokenFFN and information compensation to recover suppressed interactions. To further speed up serving, it employs W8A16 weight-only quantization to alleviate memory bandwidth bottlenecks. Extensive offline and online experiments on ByteDance platforms demonstrate up to 20% latency reduction with negligible changes in AUC and business metrics, validating UG-Sep as a practical solution for scalable dense recommender models.

Abstract

Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.

Compute Only Once: UG-Separation for Efficient Large Recommendation Models

TL;DR

Recommender systems increasingly rely on large-scale dense interaction models, which incurs prohibitive training and inference costs as feature interactions grow. The paper introduces UG-Sep, a masking-based User–Group Separation framework that disentangles user-side and item-side information flows, enabling reuse of user-side computations via a reusable PertokenFFN and information compensation to recover suppressed interactions. To further speed up serving, it employs W8A16 weight-only quantization to alleviate memory bandwidth bottlenecks. Extensive offline and online experiments on ByteDance platforms demonstrate up to 20% latency reduction with negligible changes in AUC and business metrics, validating UG-Sep as a practical solution for scalable dense recommender models.

Abstract

Driven by scaling laws, recommender systems increasingly rely on large-scale models to capture complex feature interactions and user behaviors, but this trend also leads to prohibitive training and inference costs. While long-sequence models(e.g., LONGER) can reuse user-side computation through KV caching, such reuse is difficult in dense feature interaction architectures(e.g., RankMixer), where user and group (candidate item) features are deeply entangled across layers. In this work, we propose User-Group Separation (UG-Sep), a novel framework that enables reusable user-side computation in dense interaction models for the first time. UG-Sep introduces a masking mechanism that explicitly disentangles user-side and item-side information flows within token-mixing layers, ensuring that a subset of tokens to preserve purely user-side representations across layers. This design enables corresponding token computations to be reused across multiple samples, significantly reducing redundant inference cost. To compensate for potential expressiveness loss induced by masking, we further propose an Information Compensation strategy that adaptively reconstructs suppressed user-item interactions. Moreover, as UG-Sep substantially reduces user-side FLOPs and exposes memory-bound components, we incorporate W8A16 (8-bit weight, 16-bit activation) weight-only quantization to alleviate memory bandwidth bottlenecks and achieve additional acceleration. We conduct extensive offline evaluations and large-scale online A/B experiments at ByteDance, demonstrating that UG-Sep reduces inference latency by up to 20 percent without degrading online user experience or commercial metrics across multiple business scenarios, including feed recommendation and advertising systems.
Paper Structure (26 sections, 15 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 15 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: TokenMixer-Style Layer with UG-Sep
  • Figure 2: UG-Sep with Separated Residual
  • Figure 3: Information Compensation
  • Figure 4: Attention with UG Mask