Table of Contents
Fetching ...

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Zhuocheng Gong, Ang Lv, Jian Guan, Junxi Yan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan

TL;DR

MoM addresses the rigidity of depth-ordered Transformers by introducing dynamic module assemblies controlled by per-token routers. It defines a modular pool with $m^{ ext{A}}$, $m^{ ext{F}}$, and $m^{ ext{S}}$ modules and uses routers to assemble these into a computation graph across $H$ rounds with $K$ selected modules per step, decoupling depth from parameter count. Empirically, MoM variants outperform vanilla GPT-2 on GLUE and XSUM across scales, while revealing that multi-head attention is more over-parameterized than FFN modules and enabling substantial efficiency gains through deeper or more modular computation. The framework unifies several dynamic computation approaches (layer-skip, MoE) as special cases and provides practical guidance for design choices and future architecture exploration. Limitations include the challenge of optimizing multi-step router decisions, suggesting future directions like reinforcement learning or neural architecture search, and the authors offer open-source code to facilitate adoption.

Abstract

Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different parameter counts, consistently outperform vanilla transformers on both GLUE and XSUM benchmarks. More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1.4 on GLUE and 1 on XSUM. On the other hand, MoM-large also enables an over 60% reduction in depth while involving more modules per layer, yielding a 16% reduction in TFLOPs and a 43% decrease in memory usage compared to GPT-2-large, while maintaining comparable performance.

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

TL;DR

MoM addresses the rigidity of depth-ordered Transformers by introducing dynamic module assemblies controlled by per-token routers. It defines a modular pool with , , and modules and uses routers to assemble these into a computation graph across rounds with selected modules per step, decoupling depth from parameter count. Empirically, MoM variants outperform vanilla GPT-2 on GLUE and XSUM across scales, while revealing that multi-head attention is more over-parameterized than FFN modules and enabling substantial efficiency gains through deeper or more modular computation. The framework unifies several dynamic computation approaches (layer-skip, MoE) as special cases and provides practical guidance for design choices and future architecture exploration. Limitations include the challenge of optimizing multi-step router decisions, suggesting future directions like reinforcement learning or neural architecture search, and the authors offer open-source code to facilitate adoption.

Abstract

Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different parameter counts, consistently outperform vanilla transformers on both GLUE and XSUM benchmarks. More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1.4 on GLUE and 1 on XSUM. On the other hand, MoM-large also enables an over 60% reduction in depth while involving more modules per layer, yielding a 16% reduction in TFLOPs and a 43% decrease in memory usage compared to GPT-2-large, while maintaining comparable performance.
Paper Structure (29 sections, 11 equations, 6 figures, 6 tables)

This paper contains 29 sections, 11 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Mixture-of-Modules reinvents Transformers as dynamic assemblies of modules. In (b), we illustrate the ongoing construction of an MoM model during the forward computation. The assembly lasts $H$ rounds, with the current illustration showcasing progress in the third round. For each token, routers select the best $K$ attention modules, denoted as $m^{\text{A}}_{k}$, and the best $K$ feed-forward network modules, denoted as $m^{\text{F}}_{k}$, from a module set $\mathcal{M}$ (including "SKIP" modules). These selected modules collectively constitute assembled modules $\mathcal{F}^{\text{A}}$ and $\mathcal{F}^{\text{F}}$, which are then appended to the existing computation graph. Detailed notations are presented in §\ref{['sec:method']}.
  • Figure 2: Visualization of forward computation of five models, where each consists of only two layers just for demonstration purposes. The switch icon symbolizes the selective execution of one (in Layer-skip) or more (in MoE and MoM) subsequent computation pathways.
  • Figure 3: How validation loss varies with respect to $N_{\text{A}}$ and $N_{\text{F}}$, comparing to MoM (medium) with $N_{\text{A}}=N_{\text{F}}=4$.
  • Figure 4: Validation loss for MoM-small and MoM-medium under different settings of $K$ and $H$.
  • Figure 5: Training curves of MoM-small (K1H4) with {GRU, MLP} routers.
  • ...and 1 more figures