Table of Contents
Fetching ...

Wonderful Matrices: More Efficient and Effective Architecture for Language Modeling Tasks

Jingze Shi, Bingheng Wu, Lu He, Luchang Jiang

TL;DR

The availability of inner product form position encoding in the state space dual algorithm is proved, and the effectiveness of different position embeddings in the hybrid quadratic causal self-attention and state space dual algorithms are studied.

Abstract

We prove the availability of inner product form position encoding in the state space dual algorithm and study the effectiveness of different position embeddings in the hybrid quadratic causal self-attention and state space dual algorithms. We propose inner function attention with dynamic mask, which can improve the expressiveness of the attention algorithm and avoid the sequence noise significantly affecting the accuracy of the attention score. We also design cross domain mixture of experts, which can improve the granularity of the sparse activation feedforward network while maintaining the efficiency of parameter utilization and retrieval. The combination of these methods constitutes our foundation model architecture: Wonderful Matrices. We conduct experiments on the language modeling task and find that Wonderful Matrices are more efficient and effective in handling complex language tasks.

Wonderful Matrices: More Efficient and Effective Architecture for Language Modeling Tasks

TL;DR

The availability of inner product form position encoding in the state space dual algorithm is proved, and the effectiveness of different position embeddings in the hybrid quadratic causal self-attention and state space dual algorithms are studied.

Abstract

We prove the availability of inner product form position encoding in the state space dual algorithm and study the effectiveness of different position embeddings in the hybrid quadratic causal self-attention and state space dual algorithms. We propose inner function attention with dynamic mask, which can improve the expressiveness of the attention algorithm and avoid the sequence noise significantly affecting the accuracy of the attention score. We also design cross domain mixture of experts, which can improve the granularity of the sparse activation feedforward network while maintaining the efficiency of parameter utilization and retrieval. The combination of these methods constitutes our foundation model architecture: Wonderful Matrices. We conduct experiments on the language modeling task and find that Wonderful Matrices are more efficient and effective in handling complex language tasks.
Paper Structure (40 sections, 52 equations, 8 figures, 7 tables)

This paper contains 40 sections, 52 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Wonderful Matrices Architecture. Shows the matrices used in the Wonderful Matrices architecture, including the rotary position embedding matrix, state space duality matrix, quadratic causal self-attention matrix, cross domain mixture of experts matrix, and the process of using these matrices. The specific structure and algorithm of these matrices will be detailed in subsequent chapters.
  • Figure 2: (Left) RoPE for Hybrid Algorithms. Shows the algorithm matrix of the rotary position embedding in the form of inner product. The depth of color represents the position of the position encoding, with higher color depth and lower color depth. The input tensor is first multiplied by the $QK$ or $CB$ matrix, then the sine and cosine position information is attached, and finally the relative position matrix of the scalar is obtained through the inner product operation. (Right) Inner Function Attention. Shows the structure and algorithm of inner function attention. The input tensor is first multiplied by $QK$ to obtain the query and key matrix, then the scalar attention score is calculated, and finally the attention score is attached to the value state calculated by the inner function and output.
  • Figure 3: CDMoE. Shows the internal structure and calculation process of the cross domain million mixture of experts matrix. Input tensors first pass through the shared parameters of the cross domain, then pass through a linear layer and reshape into queries, then calculate the dot product with the keys to obtain the affinity with the private experts, and finally mix the tensors carrying shared knowledge through the top K private experts with the highest affinity.
  • Figure 4: Wonderful Matrices in Language Modeling: Cheems. Shows the architecture of Wonderful Matrices applied in language modeling, including word embeddings, RMSNorm, Add (residual connection), RoPE, SSD, InnerFuncAttn, CDMoE, and LM Head modules. The black arrows indicate the calculation order of the modules, the black dashed lines indicate stacking this part $7$ times, and the black solid lines indicate stacking the entire backbone module part $N$ times. The dog in the upper right corner is the internet-famous Shiba Inu Cheems, which is our sense of humor, allowing us to relax and smile in strict formula derivation work. For the beauty of the partial table, in subsequent experiments, we will use Cheems as our model name.
  • Figure 5: Multi-Query Associative Recall. We introduced a more difficult version of the original multi-query associative recall task arora2024zoology, including longer sequence lengths, smaller model dimensions, etc. For detailed parameters, see Appendix \ref{['sec:evaluation_parameters:multi_query_associative_recall']}. InnerFuncAttn with Dynamic Mask maintains good performance in most cases.
  • ...and 3 more figures

Theorems & Definitions (1)

  • proof : Proof of equation \ref{['eq:rope_cb']}