Table of Contents
Fetching ...

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

Hussein Hazimeh, Zhe Zhao, Aakanksha Chowdhery, Maheswaran Sathiamoorthy, Yihua Chen, Rahul Mazumder, Lichan Hong, Ed H. Chi

TL;DR

DSelect-k tackles the lack of differentiability in sparse Mixture-of-Experts gates by introducing a binary-encoding reformulation that enforces a hard sparsity constraint while remaining SGD-trainable through a smooth relaxation. It provides static and per-example gating variants, along with an equivalence proof linking the unconstrained reformulation to the original problem and an entropy-based mechanism to encourage binary convergence. Empirically, DSelect-k improves predictive performance and expert selection on synthetic and real multi-task datasets, including a large-scale recommender system, while using far fewer parameters than dense gates. The work offers a practical, open-source approach to efficient, interpretable parameter sharing in MoE for multi-task learning and scalable models.

Abstract

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to $128$ tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over $22\%$ improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.

DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning

TL;DR

DSelect-k tackles the lack of differentiability in sparse Mixture-of-Experts gates by introducing a binary-encoding reformulation that enforces a hard sparsity constraint while remaining SGD-trainable through a smooth relaxation. It provides static and per-example gating variants, along with an equivalence proof linking the unconstrained reformulation to the original problem and an entropy-based mechanism to encourage binary convergence. Empirically, DSelect-k improves predictive performance and expert selection on synthetic and real multi-task datasets, including a large-scale recommender system, while using far fewer parameters than dense gates. The work offers a practical, open-source approach to efficient, interpretable parameter sharing in MoE for multi-task learning and scalable models.

Abstract

The Mixture-of-Experts (MoE) architecture is showing promising results in improving parameter sharing in multi-task learning (MTL) and in scaling high-capacity neural networks. State-of-the-art MoE models use a trainable sparse gate to select a subset of the experts for each input example. While conceptually appealing, existing sparse gates, such as Top-k, are not smooth. The lack of smoothness can lead to convergence and statistical performance issues when training with gradient-based methods. In this paper, we develop DSelect-k: a continuously differentiable and sparse gate for MoE, based on a novel binary encoding formulation. The gate can be trained using first-order methods, such as stochastic gradient descent, and offers explicit control over the number of experts to select. We demonstrate the effectiveness of DSelect-k on both synthetic and real MTL datasets with up to tasks. Our experiments indicate that DSelect-k can achieve statistically significant improvements in prediction and expert selection over popular MoE gates. Notably, on a real-world, large-scale recommender system, DSelect-k achieves over improvement in predictive performance compared to Top-k. We provide an open-source implementation of DSelect-k.

Paper Structure

This paper contains 33 sections, 2 theorems, 16 equations, 9 figures, 4 tables.

Key Result

Proposition 1

Problem eq:constrained_example is equivalentEquivalent means that the two problems have the same optimal objective, and given an optimal solution for one problem, we can construct an optimal solution for the other. to:

Figures (9)

  • Figure 1: (Left): An example of a MoE that can be used as a standalone learner or layer in a neural network. Here "Ei" denotes the $i$-th expert. (Right): A multi-gate MoE for learning two tasks simultaneously. "Task i NN" is a neural network that generates the output of Task i.
  • Figure 2: Expert weights output by Top-k (left) and DSelect-k (right) during training on synthetic data generated from a MoE, under static gating. Each color represents a separate expert. Here DSelect-k recovers the true experts used by the data-generating model, whereas Top-k does not recover and exhibits oscillatory behavior. See Appendix \ref{['sec:visualization_appendix']} for details on the data and setup.
  • Figure 3: Average performance (AUC and RMSE) and standard error on a real-world recommender system with 8 tasks: "E." and "S." denote engagement and satisfaction tasks, respectively.
  • Figure 4: Expert weights of the DSelect-k gates on the recommender system.
  • Figure B.5: The Smooth-step ($\gamma = 1$) and Logistic functions.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2