SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

Chunqi Wang; Bingchao Wu; Taotian Pang; Jiahao Wang; Jie Yang; Jia Liu; Hao Zhang; Hai Zhu; Lei Shen; Shizhun Wang; Bing Wang; Xiaoyi Zeng

SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

Chunqi Wang, Bingchao Wu, Taotian Pang, Jiahao Wang, Jie Yang, Jia Liu, Hao Zhang, Hai Zhu, Lei Shen, Shizhun Wang, Bing Wang, Xiaoyi Zeng

TL;DR

This paper proposes SORT (Systematically Optimized Ranking Transformer), a scalable model designed to bridge the gap between Transformers and industrial-scale ranking models, and introduces a suite of refinements to the tokenization, multi-head attention, and feed-forward network modules, which collectively stabilize the training process and enlarge the model capacity.

Abstract

While Transformers have achieved remarkable success in LLMs through superior scalability, their application in industrial-scale ranking models remains nascent, hindered by the challenges of high feature sparsity and low label density. In this paper, we propose SORT (Systematically Optimized Ranking Transformer), a scalable model designed to bridge the gap between Transformers and industrial-scale ranking models. We address the high feature sparsity and low label density challenges through a series of optimizations, including request-centric sample organization, local attention, query pruning and generative pre-training. Furthermore, we introduce a suite of refinements to the tokenization, multi-head attention (MHA), and feed-forward network (FFN) modules, which collectively stabilize the training process and enlarge the model capacity. To maximize hardware efficiency, we optimize our training system to elevate the model FLOPs utilization (MFU) to 22%. Extensive experiments demonstrate that SORT outperforms strong baselines and exhibits excellent scalability across data size, model size and sequence length, while remaining flexible at integrating diverse features. Finally, online A/B testing in large-scale e-commerce scenarios confirms that SORT achieves significant gains in key business metrics, including orders (+6.35%), buyers (+5.97%) and GMV (+5.47%), while simultaneously halving latency (-44.67%) and doubling throughput (+121.33%).

SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

TL;DR

Abstract

Paper Structure (35 sections, 6 equations, 12 figures, 5 tables)

This paper contains 35 sections, 6 equations, 12 figures, 5 tables.

Introduction
Related Work
SORT
Preliminaries
Task
Data
Model
Tokenization
Multi-head Attention
Feed-Forward Network
Ranking Head and Loss Function
Generative Pre-training
Infrastructure
Training System
Inference System
...and 20 more sections

Figures (12)

Figure 1: Overview of SORT.
Figure 2: Training and validation AUC curves across multiple epochs with different strategies.
Figure 3: Impacts of various hyperparameter options.
Figure 4: Heatmap visualization of the impact of special token (ST) on attention logit weights. The weights are averaged over all intra-layer attention heads.
Figure 5: Scaling SORT over data size, model size and sequence length. The SORT-base model is trained on single-scenario data for one epoch, with sequence length of 1K.
...and 7 more figures

SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

TL;DR

Abstract

SORT: A Systematically Optimized Ranking Transformer for Industrial-scale Recommenders

Authors

TL;DR

Abstract

Table of Contents

Figures (12)