Table of Contents
Fetching ...

Mixture of Hidden-Dimensions Transformer

Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

TL;DR

This work introduces Mixture of Hidden Dimensions (MoHD), a sparse activation framework for Transformers that exploits hidden-dimension sparsity by routing activations to shared and specialized sub-dimensions. MoHD preserves activation flow through activation scaling and group fusion while adding negligible compute and parameter costs. Across 10 NLP tasks, MoHD achieves notable parameter efficiency and performance gains, including $1.7\%$ higher performance with $50\%$ fewer activation parameters and $3\times$ gains in parameter expansion at constant activation cost, highlighting a scalable path to larger hidden dimensions via sparsity.

Abstract

Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency

Mixture of Hidden-Dimensions Transformer

TL;DR

This work introduces Mixture of Hidden Dimensions (MoHD), a sparse activation framework for Transformers that exploits hidden-dimension sparsity by routing activations to shared and specialized sub-dimensions. MoHD preserves activation flow through activation scaling and group fusion while adding negligible compute and parameter costs. Across 10 NLP tasks, MoHD achieves notable parameter efficiency and performance gains, including higher performance with fewer activation parameters and gains in parameter expansion at constant activation cost, highlighting a scalable path to larger hidden dimensions via sparsity.

Abstract

Transformer models encounter challenges in scaling hidden dimensions efficiently, as uniformly increasing them inflates computational and memory costs while failing to emphasize the most relevant features for each token. For further understanding, we study hidden dimension sparsity and observe that trained Transformers utilize only a small fraction of token dimensions, revealing an "activation flow" pattern. Notably, there are shared sub-dimensions with sustained activation across multiple consecutive tokens and specialized sub-dimensions uniquely activated for each token. To better model token-relevant sub-dimensions, we propose MoHD (Mixture of Hidden Dimensions), a sparse conditional activation architecture. Particularly, MoHD employs shared sub-dimensions for common token features and a routing mechanism to dynamically activate specialized sub-dimensions. To mitigate potential information loss from sparsity, we design activation scaling and group fusion mechanisms to preserve activation flow. In this way, MoHD expands hidden dimensions with negligible increases in computation or parameters, efficient training and inference while maintaining performance. Evaluations across 10 NLP tasks show that MoHD surpasses Vanilla Transformers in parameter efficiency and task performance. It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3x parameter expansion at constant activation cost. MOHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity to boost efficiency

Paper Structure

This paper contains 19 sections, 4 theorems, 1 equation, 1 figure, 1 table, 1 algorithm.

Key Result

Proposition 2.2

If $f$ is injective mapping a set $X$ to another set $Y$, the cardinality of $Y$ is at least as large as that of $X$

Figures (1)

  • Figure 1: Historical locations and number of accepted papers for International Machine Learning Conferences (ICML 1993 -- ICML 2008) and International Workshops on Machine Learning (ML 1988 -- ML 1992). At the time this figure was produced, the number of accepted papers for ICML 2008 was unknown and instead estimated.

Theorems & Definitions (7)

  • Definition 2.1
  • Proposition 2.2
  • proof
  • Lemma 2.3
  • Theorem 2.4
  • Corollary 2.5
  • Remark 2.7