Table of Contents
Fetching ...

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

Yuchen Jiang, Jie Zhu, Xintian Han, Hui Lu, Kunmin Bai, Mingyu Yang, Shikang Wu, Ruihao Zhang, Wenlin Zhao, Shipeng Bai, Sijin Zhou, Huizhi Yang, Tianyi Liu, Wenda Liu, Ziyan Gong, Haoran Ding, Zheng Chai, Deping Xie, Zhe Chen, Yuchao Zheng, Peng Xu

TL;DR

TokenMixer-Large addresses scaling challenges in industrial recommender systems by redesigning TokenMixer to support deep architectures with stable gradient flow via mixing–reverting, interval residuals, auxiliary loss, and Sparse-Pertoken MoE. It introduces Tokenization with semantic-grouped tokens and a global token, a TokenMixer-Large block with Pertoken SwiGLU, and a high-performance Token Parallel plus FP8 quantization. The model scales to 15B offline and 7B online, achieving substantial offline and online gains across Douyin's feed ads, e-commerce, and live streaming. Empirical results show improvements in AUC and real-world business metrics (orders, GMV, ADSS), validating hardware-aware co-design for practical large-scale deployment.

Abstract

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.

TokenMixer-Large: Scaling Up Large Ranking Models in Industrial Recommenders

TL;DR

TokenMixer-Large addresses scaling challenges in industrial recommender systems by redesigning TokenMixer to support deep architectures with stable gradient flow via mixing–reverting, interval residuals, auxiliary loss, and Sparse-Pertoken MoE. It introduces Tokenization with semantic-grouped tokens and a global token, a TokenMixer-Large block with Pertoken SwiGLU, and a high-performance Token Parallel plus FP8 quantization. The model scales to 15B offline and 7B online, achieving substantial offline and online gains across Douyin's feed ads, e-commerce, and live streaming. Empirical results show improvements in AUC and real-world business metrics (orders, GMV, ADSS), validating hardware-aware co-design for practical large-scale deployment.

Abstract

While scaling laws for recommendation models have gained significant traction, existing architectures such as Wukong, HiFormer and DHEN, often struggle with sub-optimal designs and hardware under-utilization, limiting their practical scalability. Our previous TokenMixer architecture (introduced in RankMixer paper) addressed effectiveness and efficiency by replacing self-attention with a ightweight token-mixing operator; however, it faced critical bottlenecks in deeper configurations, including sub-optimal residual paths, vanishing gradients, incomplete MoE sparsification and constrained scalability. In this paper, we propose TokenMixer-Large, a systematically evolved architecture designed for extreme-scale recommendation. By introducing a mixing-and-reverting operation, inter-layer residuals and the auxiliary loss, we ensure stable gradient propagation even as model depth increases. Furthermore, we incorporate a Sparse Per-token MoE to enable efficient parameter expansion. TokenMixer-Large successfully scales its parameters to 7-billion and 15-billion on online traffic and offline experiments, respectively. Currently deployed in multiple scenarios at ByteDance, TokenMixer-Large has achieved significant offline and online performance gains, delivering an increase of +1.66\% in orders and +2.98\% in per-capita preview payment GMV for e-commerce, improving ADSS by +2.0\% in advertising and achieving a +1.4\% revenue growth for live streaming.
Paper Structure (50 sections, 15 equations, 8 figures, 14 tables)

This paper contains 50 sections, 15 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: The architecture of the TokenMixer-Large. Raw tokens include all original features as well as features from sequence aggregation and extraction (such as din din/ longer longer). The entire Tokenmixer-Large model consists of multiple Tokenmixer-Large Blocks, and the backbone of each block consists of (Norm, Mixing, S-P MoE, Reverting, Norm, S-P MoE) and residuals.
  • Figure 2: Internal Residual and Auxiliary Loss
  • Figure 3: Workflow of high-performance operators in one block. Green nodes represent operators, blue nodes represent data. The asterisk (*) indicates that the data is stored and computed in FP8 quantization.
  • Figure 4: Scaling Laws on different scenarios
  • Figure 5: Scaling laws between AUC-gain and Params/Flops of various SOTA models. The x-axis uses a logarithmic scale.
  • ...and 3 more figures