Table of Contents
Fetching ...

GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploration

Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li

TL;DR

This work tackles the challenge of balancing multiple objectives in large-scale recommender systems by learning personalized fusion weights rather than relying on static, global weights. It proposes GRADE, a critic-free Group Relative Policy Optimization framework that uses Adaptive Dirichlet Exploration to efficiently search the fusion-weight space and a composite reward that combines posterior user feedback, prior model predictions, and heuristic weight-format constraints. The approach uses a two-stage training pipeline: Stage 1 with supervised Learning-to-Rank initialization via LambdaLoss, and Stage 2 with GRPO fine-tuning that updates a policy $oldsymbol{ heta}$ based on relative advantages of groups of weight vectors, while regularizing with a KL term to a reference policy. Empirical results from offline auctions and large-scale online A/B tests show meaningful gains in CTR, CVR, OPM, and GPM, with ablations highlighting the importance of the Dirichlet exploration strategy and the composite reward in preventing reward hacking and achieving robust, personalized performance. The method is deployed in a real marketplace setting, demonstrating practical impact on hundreds of millions of users and illustrating the importance of principled exploration and user-centric objective fusion in production systems.

Abstract

Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595\% in CTR, +1.193\% in CVR, +1.788\% in OPM, and +1.568\% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.

GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploration

TL;DR

This work tackles the challenge of balancing multiple objectives in large-scale recommender systems by learning personalized fusion weights rather than relying on static, global weights. It proposes GRADE, a critic-free Group Relative Policy Optimization framework that uses Adaptive Dirichlet Exploration to efficiently search the fusion-weight space and a composite reward that combines posterior user feedback, prior model predictions, and heuristic weight-format constraints. The approach uses a two-stage training pipeline: Stage 1 with supervised Learning-to-Rank initialization via LambdaLoss, and Stage 2 with GRPO fine-tuning that updates a policy based on relative advantages of groups of weight vectors, while regularizing with a KL term to a reference policy. Empirical results from offline auctions and large-scale online A/B tests show meaningful gains in CTR, CVR, OPM, and GPM, with ablations highlighting the importance of the Dirichlet exploration strategy and the composite reward in preventing reward hacking and achieving robust, personalized performance. The method is deployed in a real marketplace setting, demonstrating practical impact on hundreds of millions of users and illustrating the importance of principled exploration and user-centric objective fusion in production systems.

Abstract

Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595\% in CTR, +1.193\% in CVR, +1.788\% in OPM, and +1.568\% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.

Paper Structure

This paper contains 22 sections, 17 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Overall architecture of the personalized multi-objective ranking system. It comprises: (1) a Feature Center and Prerank Model for initial feature processing and candidate generation; (2) a Multi-Task Learning (MTL) model predicting various user feedback signals; (3) a Multi-Task Fusion (MTF) module (our proposed GRADE framework) that learns personalized weights ($w_1, \dots, w_n$); these weights are then applied to calculate final scores and sorted to generate a blended ranking by the Blended Ranking Model, which ultimately delivers results to users.
  • Figure 2: The architecture of the Stage 1 supervised Learning-to-Rank (LTR) model, which is pre-trained to provide a robust baseline policy. The model learns to generate fusion weights using a multi-objective pairwise loss based on the LambdaLoss framework. In the depicted LambdaLoss formula, $S_c$ and $S_a$ represent the fusion scores of two items, and $\sigma$ denotes the sigmoid function.
  • Figure 3: The training loop for Stage 2: GRPO-based fine-tuning.