Table of Contents
Fetching ...

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

Jingze Shi, Bingheng Wu

TL;DR

Wonderful Matrices addresses the challenge of building foundation models that are both efficient and effective by unifying sequence transformation (attention-like mechanisms) and state transformation (expert retrieval) into a single architecture. It introduces Rotary Position Embedding for hybrid algorithms, Dynamic Mask Attention to filter past states, and Cross Domain Mixture of Experts to reuse general and domain-specific knowledge, instantiated in the Cheems language model. The empirical evaluation shows improvements in perplexity on long sequences, robust associative recall, and favorable downstream metrics compared to baselines, with high parameter efficiency even at large expert counts. Together, these contributions offer a scalable, competitive foundation-model design for language modeling.

Abstract

In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8 to 10 times faster than the mixture of experts, to ensure that the combining state transformation quickly retrieval mixture. Finally, we summarize these matrix algorithms that can form the foundation model: Wonderful Matrices, which can be a competitor to popular model architectures.

Wonderful Matrices: Combining for a More Efficient and Effective Foundation Model Architecture

TL;DR

Wonderful Matrices addresses the challenge of building foundation models that are both efficient and effective by unifying sequence transformation (attention-like mechanisms) and state transformation (expert retrieval) into a single architecture. It introduces Rotary Position Embedding for hybrid algorithms, Dynamic Mask Attention to filter past states, and Cross Domain Mixture of Experts to reuse general and domain-specific knowledge, instantiated in the Cheems language model. The empirical evaluation shows improvements in perplexity on long sequences, robust associative recall, and favorable downstream metrics compared to baselines, with high parameter efficiency even at large expert counts. Together, these contributions offer a scalable, competitive foundation-model design for language modeling.

Abstract

In order to make the foundation model more efficient and effective, our idea is combining sequence transformation and state transformation. First, we prove the availability of rotary position embedding in the state space duality algorithm, which reduces the perplexity of the hybrid quadratic causal self-attention and state space duality by more than 4%, to ensure that the combining sequence transformation unifies position encoding. Second, we propose dynamic mask attention, which maintains 100% accuracy in the more challenging multi-query associative recall task, improving by more than 150% compared to quadratic causal self-attention and state space duality, to ensure that the combining sequence transformation selectively filters relevant information. Third, we design cross domain mixture of experts, which makes the computational speed of expert retrieval with more than 1024 experts 8 to 10 times faster than the mixture of experts, to ensure that the combining state transformation quickly retrieval mixture. Finally, we summarize these matrix algorithms that can form the foundation model: Wonderful Matrices, which can be a competitor to popular model architectures.

Paper Structure

This paper contains 34 sections, 46 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Wonderful Matrices Architecture. Shows the matrices used in the Wonderful Matrices Architecture, including the Rotary Position Embedding Matrix, State Space Duality Matrix, Dynamic Mask Attention Matrix, Cross Domain Mixture of Experts Matrix, and the process of using these matrices.
  • Figure 2: Rotary Position Embedding. Shows the algorithm of Rotary Position Embedding. In the case of input containing sequence dimension and hidden dimension, first add absolute position information $m$ to the $Q$ and $C$ matrices, add absolute position information $n$ to the $K$ and $B$ matrices, then multiply the rotation matrix $\mathbb{R}_{\Theta, m}^{d}$ and $\mathbb{R}_{\Theta, n}^{d}$ with the $QK$ or $CB$ matrix to obtain the rotary position encoding matrix, and finally apply the mask matrix and output.
  • Figure 3: Dynamic Mask Attention. Shows the algorithm of Dynamic Mask Attention. The input is first projected through the projection function to obtain $QKV$, then the attention score is calculated by the $Q$ state and the concatenated past state $K$ state, the causal mask is applied to the attention score, and finally the score matrix is applied with the dynamic mask related to the concatenated past state $V$ state, and output to the $V$ state.
  • Figure 4: Cross Domain Mixture of Experts. Shows the algorithm of Cross Domain Mixture of Experts. The inputs first passes through the query projection, then calculates the dot product with the keys to obtain the affinity with the private experts, then activates the top K private expert parameters with high affinity, and finally mixes with the cross domain parameters and outputs.
  • Figure 5: Wonderful Matrices in Language Modeling: Cheems. Shows the architecture of Wonderful Matrices applied in language modeling, including Word Embedding, RMSNorm, Residual, RoPE, SSD, DMAttn, CDMoE, LM Head modules. The black arrows indicate the calculation order of the modules, the black dashed part indicates stacking this part $7$ times, and the black solid line indicates stacking the entire backbone module part $N$ times. The dog in the upper right corner is the internet-famous Shiba Inu Cheems, which is our sense of humor, allowing us to relax and smile in strict formula derivation work. For the beauty of the table, in subsequent experiments, we will use Cheems as our model name.
  • ...and 5 more figures

Theorems & Definitions (1)

  • proof : Proof of equation \ref{['eq:rope_for_attn_ssd:score']}