Table of Contents
Fetching ...

Differentiable Fast Top-K Selection for Large-Scale Recommendation

Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang

TL;DR

This work tackles the non-differentiability of the Top-K operator in cascade ranking by introducing DFTopK, a differentiable Top-K operator with linear-time complexity $O(n)$ that bypasses sorting-induced gradient conflicts. DFTopK uses a threshold-based relaxation with $\theta(x)=\frac{x_{[k]}+x_{[k+1]}}{2}$ and $f_k(x)=\sigma(x-\theta(x))$, enabling end-to-end training with a BCE loss and a temperature parameter $\tau$ to control approximation fidelity. The authors provide theoretical analysis showing localized gradients near the $k$-th and $(k+1)$-th elements and demonstrate that this approach reduces gradient conflicts while delivering competitive or superior performance and significant efficiency gains in both offline benchmarks (RecFlow) and online industrial deployments. They validate DFTopK through comprehensive offline experiments, streaming evaluations, and an online A/B test in advertising, reporting revenue and conversion improvements under equal budgets and substantial reductions in per-impression latency, thereby enabling scalable training and deployment for large-scale recommendation systems. The work also contributes an open-source implementation to accelerate future research and industrial adoption in differentiable Top-K modeling.

Abstract

Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

Differentiable Fast Top-K Selection for Large-Scale Recommendation

TL;DR

This work tackles the non-differentiability of the Top-K operator in cascade ranking by introducing DFTopK, a differentiable Top-K operator with linear-time complexity that bypasses sorting-induced gradient conflicts. DFTopK uses a threshold-based relaxation with and , enabling end-to-end training with a BCE loss and a temperature parameter to control approximation fidelity. The authors provide theoretical analysis showing localized gradients near the -th and -th elements and demonstrate that this approach reduces gradient conflicts while delivering competitive or superior performance and significant efficiency gains in both offline benchmarks (RecFlow) and online industrial deployments. They validate DFTopK through comprehensive offline experiments, streaming evaluations, and an online A/B test in advertising, reporting revenue and conversion improvements under equal budgets and substantial reductions in per-impression latency, thereby enabling scalable training and deployment for large-scale recommendation systems. The work also contributes an open-source implementation to accelerate future research and industrial adoption in differentiable Top-K modeling.

Abstract

Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

Paper Structure

This paper contains 18 sections, 20 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: A typical cascade ranking architecture. Including four stages: Matching, Pre-ranking, Ranking, and Mix-ranking. The red points represent the ground truth for the selection.
  • Figure 2: Gradient conflict in soft permutation matrix. In NeuralSort, the sum-to-one constraint in each row inevitably induces zero-sum competition among ground-truth items, causing gradient conflict in every row.
  • Figure 3: Sensitivity Analysis of $\tau$. This figure shows the effect of $\tau$ on our operator. It illustrates the trade-off between approximation hardness and gradient magnitude, demonstrating robustness across a reasonable range.
  • Figure 4: Performance under varying negative sampling sizes (Top-4 methods shown). DFTopK consistently achieves SOTA performance across multiple data-scaling settings, demonstrating its efficiency and robustness.
  • Figure 5: Streaming evaluation of Top-4 methods. DFTopK shows superior adaptability and long-term stability across dynamic data streams.