Table of Contents
Fetching ...

Generative Early Stage Ranking

Juhee Hong, Meng Liu, Shengzhi Wang, Xiaoheng Mao, Huihui Cheng, Leon Gao, Christopher Leung, Jin Zhou, Chandra Mouli Sekar, Zhao Zhu, Ruochen Liu, Tuan Trieu, Dawei Sun, Jeet Kanjani, Rui Li, Jing Qian, Xuan Cao, Minjie Fan, Mingze Gao

TL;DR

The paper tackles the effectiveness-efficiency tradeoff in Early Stage Ranking by introducing Generative Early Stage Ranking (GESR), which enriches the traditional Two Tower ESR with a Mixture of Attention (MoA) module and Multi-Logit Parameterized Gating (MLPG). MoA combines explicit signals from Hard Matching Attention (HMA) with target-aware self and cross-attention to capture nuanced user-item affinities at the ESR stage, while MLPG enhances final scoring through parallel logits and dynamic gating. The approach is backed by extensive offline and online experiments showing improved personalization and engagement metrics, with notable scalability optimizations including FP8 quantization, kernel optimizations, and Torch Inductor-based deployment. This work demonstrates, at industrial scale, that integrating diverse cross-signals early—coupled with efficient serving and optimization—yields meaningful performance gains without prohibitive latency, suggesting a shift in ESR design toward generative, target-aware architectures. Future directions include extending GESR to more products, leveraging longer user histories, and exploring cross-sharing between ESR and LSR stages.

Abstract

Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the "user-item decoupling" approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA's specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.

Generative Early Stage Ranking

TL;DR

The paper tackles the effectiveness-efficiency tradeoff in Early Stage Ranking by introducing Generative Early Stage Ranking (GESR), which enriches the traditional Two Tower ESR with a Mixture of Attention (MoA) module and Multi-Logit Parameterized Gating (MLPG). MoA combines explicit signals from Hard Matching Attention (HMA) with target-aware self and cross-attention to capture nuanced user-item affinities at the ESR stage, while MLPG enhances final scoring through parallel logits and dynamic gating. The approach is backed by extensive offline and online experiments showing improved personalization and engagement metrics, with notable scalability optimizations including FP8 quantization, kernel optimizations, and Torch Inductor-based deployment. This work demonstrates, at industrial scale, that integrating diverse cross-signals early—coupled with efficient serving and optimization—yields meaningful performance gains without prohibitive latency, suggesting a shift in ESR design toward generative, target-aware architectures. Future directions include extending GESR to more products, leveraging longer user histories, and exploring cross-sharing between ESR and LSR stages.

Abstract

Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the "user-item decoupling" approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA's specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Multi-stage Cascading Ranking System
  • Figure 2: GESR Paradigm showcasing the preprocessing procedures in MoA Config, and the mixture of various attention mechanisms in the MoA block, and the signal amplification through MLPG. The vanilla Two Tower model paradigm is highlighted in red dashed lines as comparison.
  • Figure 3: (1) Target-Aware Self Attention module (2) NRO Cross Attention module (3) RO Cross Attention module
  • Figure 4: Overview of Hard Matching Attention module
  • Figure 5: (1) the traditional Two Tower serving, where item embeddings are precomputed offline and cached (2) GESR serving, which concatenates item IDs with user history sequence and caches additional item features for HMA module