Differentiable Fast Top-K Selection for Large-Scale Recommendation

Yanjie Zhu; Zhen Zhang; Yunli Wang; Zhiqiang Wang; Yu Li; Rufan Zhou; Shiyang Wen; Peng Jiang; Chenhao Lin; Jian Yang

Differentiable Fast Top-K Selection for Large-Scale Recommendation

Yanjie Zhu, Zhen Zhang, Yunli Wang, Zhiqiang Wang, Yu Li, Rufan Zhou, Shiyang Wen, Peng Jiang, Chenhao Lin, Jian Yang

TL;DR

This work tackles the non-differentiability of the Top-K operator in cascade ranking by introducing DFTopK, a differentiable Top-K operator with linear-time complexity $O(n)$ that bypasses sorting-induced gradient conflicts. DFTopK uses a threshold-based relaxation with $\theta(x)=\frac{x_{[k]}+x_{[k+1]}}{2}$ and $f_k(x)=\sigma(x-\theta(x))$, enabling end-to-end training with a BCE loss and a temperature parameter $\tau$ to control approximation fidelity. The authors provide theoretical analysis showing localized gradients near the $k$-th and $(k+1)$-th elements and demonstrate that this approach reduces gradient conflicts while delivering competitive or superior performance and significant efficiency gains in both offline benchmarks (RecFlow) and online industrial deployments. They validate DFTopK through comprehensive offline experiments, streaming evaluations, and an online A/B test in advertising, reporting revenue and conversion improvements under equal budgets and substantial reductions in per-impression latency, thereby enabling scalable training and deployment for large-scale recommendation systems. The work also contributes an open-source implementation to accelerate future research and industrial adoption in differentiable Top-K modeling.

Abstract

Cascade ranking is a widely adopted paradigm in large-scale information retrieval systems for Top-K item selection. However, the Top-K operator is non-differentiable, hindering end-to-end training. Existing methods include Learning-to-Rank approaches (e.g., LambdaLoss), which optimize ranking metrics like NDCG and suffer from objective misalignment, and differentiable sorting-based methods (e.g., ARF, LCRON), which relax permutation matrices for direct Top-K optimization but introduce gradient conflicts through matrix aggregation. A promising alternative is to directly construct a differentiable approximation of the Top-K selection operator, bypassing the use of soft permutation matrices. However, even state-of-the-art differentiable Top-K operator (e.g., LapSum) require $O(n \log n)$ complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal $O(n)$ time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

Differentiable Fast Top-K Selection for Large-Scale Recommendation

TL;DR

This work tackles the non-differentiability of the Top-K operator in cascade ranking by introducing DFTopK, a differentiable Top-K operator with linear-time complexity

that bypasses sorting-induced gradient conflicts. DFTopK uses a threshold-based relaxation with

and

, enabling end-to-end training with a BCE loss and a temperature parameter

to control approximation fidelity. The authors provide theoretical analysis showing localized gradients near the

-th and

-th elements and demonstrate that this approach reduces gradient conflicts while delivering competitive or superior performance and significant efficiency gains in both offline benchmarks (RecFlow) and online industrial deployments. They validate DFTopK through comprehensive offline experiments, streaming evaluations, and an online A/B test in advertising, reporting revenue and conversion improvements under equal budgets and substantial reductions in per-impression latency, thereby enabling scalable training and deployment for large-scale recommendation systems. The work also contributes an open-source implementation to accelerate future research and industrial adoption in differentiable Top-K modeling.

Abstract

complexity due to their dependence on sorting for solving the threshold. Thus, we propose DFTopK, a novel differentiable Top-K operator achieving optimal

time complexity. By relaxing normalization constraints, DFTopK admits a closed-form solution and avoids sorting. DFTopK also avoids the gradient conflicts inherent in differentiable sorting-based methods. We evaluate DFTopK on both the public benchmark RecFLow and an industrial system. Experimental results show that DFTopK significantly improves training efficiency while achieving superior performance, which enables us to scale up training samples more efficiently. In the online A/B test, DFTopK yielded a +1.77% revenue lift with the same computational budget compared to the baseline. To the best of our knowledge, this work is the first to introduce differentiable Top-K operators into recommendation systems and the first to achieve theoretically optimal linear-time complexity for Top-K selection. We have open-sourced our implementation to facilitate future research in both academia and industry.

Differentiable Fast Top-K Selection for Large-Scale Recommendation

TL;DR

Abstract

Differentiable Fast Top-K Selection for Large-Scale Recommendation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)