GRADE: Personalized Multi-Task Fusion via Group-relative Reinforcement Learning with Adaptive Dirichlet Exploration
Tingfeng Hong, Pingye Ren, Xinlong Xiao, Chao Wang, Chenyi Lei, Wenwu Ou, Han Li
TL;DR
This work tackles the challenge of balancing multiple objectives in large-scale recommender systems by learning personalized fusion weights rather than relying on static, global weights. It proposes GRADE, a critic-free Group Relative Policy Optimization framework that uses Adaptive Dirichlet Exploration to efficiently search the fusion-weight space and a composite reward that combines posterior user feedback, prior model predictions, and heuristic weight-format constraints. The approach uses a two-stage training pipeline: Stage 1 with supervised Learning-to-Rank initialization via LambdaLoss, and Stage 2 with GRPO fine-tuning that updates a policy $oldsymbol{ heta}$ based on relative advantages of groups of weight vectors, while regularizing with a KL term to a reference policy. Empirical results from offline auctions and large-scale online A/B tests show meaningful gains in CTR, CVR, OPM, and GPM, with ablations highlighting the importance of the Dirichlet exploration strategy and the composite reward in preventing reward hacking and achieving robust, personalized performance. The method is deployed in a real marketplace setting, demonstrating practical impact on hundreds of millions of users and illustrating the importance of principled exploration and user-centric objective fusion in production systems.
Abstract
Balancing multiple objectives is critical for user satisfaction in modern recommender and search systems, yet current Multi-Task Fusion (MTF) methods rely on static, manually-tuned weights that fail to capture individual user intent. While Reinforcement Learning (RL) offers a path to personalization, traditional approaches often falter due to training instability and the sparse rewards inherent in these large-scale systems. To address these limitations, we propose Group-relative Reinforcement learning with Adaptive Dirichlet Exploration (GRADE), a novel and robust framework for personalized multi-task fusion. GRADE leverages a critic-free, Group Relative Policy Optimization (GRPO) paradigm, enabling stable and efficient policy learning by evaluating the relative performance of candidate weight groups. Its core innovations include employing the Dirichlet distribution for principled and structured exploration of the weight space, and a composite reward function that combines sparse user feedback with dense model priors and rule-based constraints to guide the search effectively. Deployed in the in-app marketplace of an application with over hundreds of millions daily active users, GRADE significantly outperforms established baselines, achieving substantial gains in rigorous large-scale A/B tests: +0.595\% in CTR, +1.193\% in CVR, +1.788\% in OPM, and +1.568\% in total order volume. Following its strong performance, GRADE has been fully deployed in the marketplace search scenario of Kuaishou, serving hundreds of millions of users.
