Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Ganesh Jawahar; Haichuan Yang; Yunyang Xiong; Zechun Liu; Dilin Wang; Fei Sun; Meng Li; Aasish Pappu; Barlas Oguz; Muhammad Abdul-Mageed; Laks V. S. Lakshmanan; Raghuraman Krishnamoorthi; Vikas Chandra

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Ganesh Jawahar, Haichuan Yang, Yunyang Xiong, Zechun Liu, Dilin Wang, Fei Sun, Meng Li, Aasish Pappu, Barlas Oguz, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Raghuraman Krishnamoorthi, Vikas Chandra

TL;DR

Mixture-of-Supernets (MoS) introduces architecture-routed mixture-of-experts to overcome fundamental limitations of weight-sharing supernets in NLP NAS. By replacing fixed trimming with architecture-conditioned weight generation, MoS provides layer-wise and neuron-wise variants that customize weights for each candidate architecture, improving expressiveness and reducing retraining needs. Extensive experiments on efficient BERT and machine translation demonstrate SoTA performance and superior latency-BLEU tradeoffs, while maintaining modest overhead and reducing the burden of retraining after search. The approach achieves state-of-the-art results in task-agnostic BERT and MT model design, offering a practical pathway to faster, more reliable NAS in NLP domains.

Abstract

Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

TL;DR

Abstract

Paper Structure (46 sections, 4 equations, 6 figures, 18 tables)

This paper contains 46 sections, 4 equations, 6 figures, 18 tables.

Introduction
Supernet - Fundamentals
Mixture-of-Supernets
Generalized Model Function
Layer-wise MoS
Neuron-wise MoS
Adding $g(x,a;E)$ to Transformer
Experiments - Efficient BERT
Experiment Setup
Supernet vs. standalone gap
Comparison with SoTA NAS
Experiments - Efficient MT
Experiment setup
Supernet vs. standalone gap
Comparison with the SoTA NAS
...and 31 more sections

Figures (6)

Figure 1: Choices of linear layers for supernet training. The length and the height of the 'Linear' blocks correspond to the number of input and output features of the supernet respectively. The highlighted portions in blue color correspond to the architecture-specific weights extracted from the supernet. Different intensities of blue color in the 'Linear' blocks of the mixture-of-supernet correspond to different alignment scores generated by the router.
Figure 2: Learning Curve - Training steps vs. Validation MLM loss. 'Big' and 'Small' correspond to the largest and the smallest BERT architecture respectively from the search space of SuperShaper. 'Standalone' and 'Supernet' correspond to training from scratch and sampling from the supernet respectively. All the supernets are trained with sandwich training.
Figure 3: Supernet vs. Standalone model performance for 15 random architectures from MT search space. Supernet performance is obtained by evaluating the architecture-specific weights extracted from the supernet. Standalone model performance is obtained by training the architecture from scratch to convergence and evaluating it.
Figure 4: Additional training steps to close the supernet - standalone gap vs. performance for different latency constraints on the WMT'14 En-De dataset.
Figure 5: Additional training steps to close the supernet - standalone gap vs. performance for different latency constraints on the WMT'14 En-Fr dataset.
...and 1 more figures

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

TL;DR

Abstract

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (6)