Table of Contents
Fetching ...

Direct Routing Gradient (DRGrad): A Personalized Information Surgery for Multi-Task Learning (MTL) Recommendations

Yuguang Liu, Yiyun Miao, Luyao Xia

TL;DR

DRGrad tackles gradient conflicts in industrial-scale multi-task learning for recommender systems by routing gradient information through a Router Network, aggregating it with an Updater Network, and injecting personalization via a PPNet-based gate. The approach combines a Split-MMoE-like structure to protect the primary task signal with a personalization module to tailor gradients to individual users. Empirical results on a real-world 15B-sample dataset, plus Census-Income and synthetic data, show consistent AUC gains for primary and auxiliary tasks and notable online improvements, with minimal latency overhead. Together, these contributions deliver a scalable, end-to-end gradient-routing mechanism that enhances MTL performance in production-grade recommender systems.

Abstract

Multi-task learning (MTL) has emerged as a successful strategy in industrial-scale recommender systems, offering significant advantages such as capturing diverse users' interests and accurately detecting different behaviors like ``click" or ``dwell time". However, negative transfer and the seesaw phenomenon pose challenges to MTL models due to the complex and often contradictory task correlations in real-world recommendations. To address the problem while making better use of personalized information, we propose a personalized Direct Routing Gradient framework (DRGrad), which consists of three key components: router, updater and personalized gate network. DRGrad judges the stakes between tasks in the training process, which can leverage all valid gradients for the respective task to reduce conflicts. We evaluate the efficiency of DRGrad on complex MTL using a real-world recommendation dataset with 15 billion samples. The results show that DRGrad's superior performance over competing state-of-the-art MTL models, especially in terms of AUC (Area Under the Curve) metrics, indicating that it effectively manages task conflicts in multi-task learning environments without increasing model complexity, while also addressing the deficiencies in noise processing. Moreover, experiments on the public Census-income dataset and Synthetic dataset, have demonstrated the capability of DRGrad in judging and routing the stakes between tasks with varying degrees of correlation and personalization.

Direct Routing Gradient (DRGrad): A Personalized Information Surgery for Multi-Task Learning (MTL) Recommendations

TL;DR

DRGrad tackles gradient conflicts in industrial-scale multi-task learning for recommender systems by routing gradient information through a Router Network, aggregating it with an Updater Network, and injecting personalization via a PPNet-based gate. The approach combines a Split-MMoE-like structure to protect the primary task signal with a personalization module to tailor gradients to individual users. Empirical results on a real-world 15B-sample dataset, plus Census-Income and synthetic data, show consistent AUC gains for primary and auxiliary tasks and notable online improvements, with minimal latency overhead. Together, these contributions deliver a scalable, end-to-end gradient-routing mechanism that enhances MTL performance in production-grade recommender systems.

Abstract

Multi-task learning (MTL) has emerged as a successful strategy in industrial-scale recommender systems, offering significant advantages such as capturing diverse users' interests and accurately detecting different behaviors like ``click" or ``dwell time". However, negative transfer and the seesaw phenomenon pose challenges to MTL models due to the complex and often contradictory task correlations in real-world recommendations. To address the problem while making better use of personalized information, we propose a personalized Direct Routing Gradient framework (DRGrad), which consists of three key components: router, updater and personalized gate network. DRGrad judges the stakes between tasks in the training process, which can leverage all valid gradients for the respective task to reduce conflicts. We evaluate the efficiency of DRGrad on complex MTL using a real-world recommendation dataset with 15 billion samples. The results show that DRGrad's superior performance over competing state-of-the-art MTL models, especially in terms of AUC (Area Under the Curve) metrics, indicating that it effectively manages task conflicts in multi-task learning environments without increasing model complexity, while also addressing the deficiencies in noise processing. Moreover, experiments on the public Census-income dataset and Synthetic dataset, have demonstrated the capability of DRGrad in judging and routing the stakes between tasks with varying degrees of correlation and personalization.

Paper Structure

This paper contains 19 sections, 8 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: (a) defines $\theta$ as the angle between gradients. When $\theta > 90^\circ$, gradients will update in opposite directions, resulting in conflicts. When $\theta < 90^\circ$, different gradients will cooperate with each other. (b) seperates $task_1$ into two parts, one uses a dedicated layer and the other shares layer with $task_2$.
  • Figure 2: DRGrad model structure. The DNN tower ${T_1}^{'}$ takes the dedicated tensor $v_1$ as its input, and ${T_1}^{"}$ shares the same input tensor, named $v_s$, with ${T_2}$. $Task_1$ is aggregated by the output of ${T_1}^{'}$ and ${T_1}^{"}$, refer as ${T_1}^{'}(v_1)$ and ${T_1}^{"}(v_s)$. The Tensor $v_{PPNet}$ is the input of PPNet, containing the personalized embedding of users. $G$ is the Gate Network, using softmax function and $G_p$ is Gate Network for PPNet, using sigmoid function.
  • Figure 3: (a) is Router network. The gradients ${g_1}^{'}$, ${g_1}^{"}$ and ${g_2}$ are the inputs of Router Network, which come from $task_1$ and $task_2$. The processed gradients $g_{R,1}^{'}$ and $g_{R,1}^{"}$ are the outputs, used to update the parameters of ${T_1}^{'}$, ${T_1}^{"}$. (b) is Updater network. Gradient ${g_1}^{'}$, ${g_1}^{"}$, $g_{R,1}^{'}$ and $g_{R,1}^{"}$ are the inputs of Updater Network, and the outputs $\mu_1^{'}$, $\mu_1^{"}$ are used to aggregate $task_1$ dynamically. (c) is Personalized Gradients, $g_1^E$ represents the gradient expectation of all users, $g_1^{U_1}$ represents user $U_1$.
  • Figure 4: Grad’s direction with respect to click in Fig. 1(b). The gradient direction between tasks fluctuates violently between positive and negative.
  • Figure 6: Grad’s direction to click in DRGrad model. The direction between the gradients becomes same direction and is easier to converge.
  • ...and 1 more figures