HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

Xu Wang; Jiangxia Cao; Zhiyi Fu; Kun Gai; Guorui Zhou

HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

Xu Wang, Jiangxia Cao, Zhiyi Fu, Kun Gai, Guorui Zhou

TL;DR

HoME addresses practical instabilities in industrial multi-task MoE models by introducing Expert Normalization, Hierarchy Mask, and Gate mechanisms to balance and leverage a large set of shared and task-specific experts. The approach yields stable training, improved offline metrics, and meaningful online gains across Kuaishou's short-video services, enabling deployment at a scale of hundreds of millions of users. This work demonstrates that careful architectural design and gating strategies can significantly enhance multi-task MoE performance in real-world recommender systems. The practical impact is a more reliable, efficient multi-task MoE framework capable of handling dense and sparse tasks without collapsing or degrading specialized experts.

Abstract

In this paper, we present the practical problems and the lessons learned at short-video services from Kuaishou. In industry, a widely-used multi-task framework is the Mixture-of-Experts (MoE) paradigm, which always introduces some shared and specific experts for each task and then uses gate networks to measure related experts' contributions. Although the MoE achieves remarkable improvements, we still observe three anomalies that seriously affect model performances in our iteration: (1) Expert Collapse: We found that experts' output distributions are significantly different, and some experts have over 90% zero activations with ReLU, making it hard for gate networks to assign fair weights to balance experts. (2) Expert Degradation: Ideally, the shared-expert aims to provide predictive information for all tasks simultaneously. Nevertheless, we find that some shared-experts are occupied by only one task, which indicates that shared-experts lost their ability but degenerated into some specific-experts. (3) Expert Underfitting: In our services, we have dozens of behavior tasks that need to be predicted, but we find that some data-sparse prediction tasks tend to ignore their specific-experts and assign large weights to shared-experts. The reason might be that the shared-experts can perceive more gradient updates and knowledge from dense tasks, while specific-experts easily fall into underfitting due to their sparse behaviors. Motivated by those observations, we propose HoME to achieve a simple, efficient and balanced MoE system for multi-task learning.

HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 6 figures, 4 tables)

This paper contains 17 sections, 12 equations, 6 figures, 4 tables.

Introduction
Related Works
Methodology
Preliminary: Multi-Task Learning for Industrial Recommender System
Label&Feature
Mixture-of-Experts for XTR prediction
Expert Normalization&Swish Mechanism
Hierarchy Mask Mechanism
Feature-gate&Self-gate mechanisms
Experiments
Experiments Setup
Offline Experiments
Discussion of Hyper-Parameter Sensitivity
Discussion of HoME Situation
Online A/B Test
...and 2 more sections

Figures (6)

Figure 1: Typical multi-task behaviors at Kuaishou.
Figure 2: Illustration of a naive MMoE and the expert collapse issue occurring in practice. As shown in (b), expert6 always assigned the biggest gate value, over 0.98 in most cases, by all tasks. We also noticed that expert6 outputs much more smaller and sparser activation values than other experts, as shown in (c). Those phenomena indicate that in the real data-streaming scenario, MMoE is unstable and easy to collapse, which obstacles fair comparisons among experts and impacts model performance.
Figure 3: Expert degradation issue in CGC, where the two shared experts are almost monopolized by task2 and task7, respectively, working in a specific style.
Figure 4: Expert underfitting issue, where task1 and task6 almost rely on shared experts only and ignore their own specific expert, making less use of the specific expert network.
Figure 5: The HoME and other MoE-style multi-task learning architectures. In HoME, tasks are divided into groups based on their relatedness and modeled as fully-shared or partial-shared meta-representations in the first layer, then refined as specific task representations in the second layer. HoME further introduces two specially designed modules: Feature-gate to alleviate task conflicts at the input level, and Self-gate to ensure that each task makes the most of specific experts. Best viewed in color.
...and 1 more figures

HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

TL;DR

Abstract

HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou

Authors

TL;DR

Abstract

Table of Contents

Figures (6)