Table of Contents
Fetching ...

SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search

Renzhe Zhou, Songyang Li, Feiran Zhu, Chenglei Dai, Yi Zhang, Yi Wang, Jingwei Zhuo

Abstract

Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.

SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search

Abstract

Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.
Paper Structure (22 sections, 15 equations, 8 figures, 3 tables)

This paper contains 22 sections, 15 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Cascaded search system architecture.
  • Figure 2: Overview of the SaFRO framework. After embedding input features into a state, the fusion policy employs relation matrix to output weight distributions. During training, sampled weights (depicted as squares) are fused with their corresponding predicted scores (depicted as circles, e.g. purple for CTR) to generate ranked lists, which are evaluated by a composite reward function, followed by policy updates using dual-relative advantage.
  • Figure 3: Comparison of user behavioral patterns by retention status. Left: Daily query reformulation rates observed over a one-month period. Right: Distribution of session gaps (log-scale) across different quantiles.
  • Figure 4: An illustration of dual-relative advantage.
  • Figure 5: Comparison between GRPO and DRPO. Left: Advantage distribution with red and green dashed lines indicating amplification and suppression, respectively. Right: Policy gradient where arrows and annotations illustrate how Term II transforms GRPO gradients into DRPO gradients.
  • ...and 3 more figures