PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

Yunhe Wang; Hanting Chen; Yehui Tang; Tianyu Guo; Kai Han; Ying Nie; Xutao Wang; Hailin Hu; Zheyuan Bai; Yun Wang; Fangcheng Liu; Zhicheng Liu; Jianyuan Guo; Sinan Zeng; Yinchen Zhang; Qinghua Xu; Qun Liu; Jun Yao; Chao Xu; Dacheng Tao

PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao Wang, Hailin Hu, Zheyuan Bai, Yun Wang, Fangcheng Liu, Zhicheng Liu, Jianyuan Guo, Sinan Zeng, Yinchen Zhang, Qinghua Xu, Qun Liu, Jun Yao, Chao Xu, Dacheng Tao

TL;DR

PanGu-π addresses feature collapse in Transformer-based LLMs by injecting nonlinearity through two modules: a Series Informed Activation Function in FFN and Augmented Shortcuts in MSA. Theoretical bounds and ablation studies show these components synergistically enhance nonlinear expressive power, enabling PanGu-π-7B to match or exceed state-of-the-art baselines with improved efficiency, and PanGu-π-1B to achieve strong performance on par with larger models. The YunShan domain-specialized LLM demonstrates the approach’s practicality in finance and law, delivering superior benchmarks via domain-focused pretraining, tokenizer expansion, and instruction tuning. Overall, the work highlights nonlinearity as a core driver of expressive capacity in LLMs and offers a scalable architecture for both general and domain-specific NLP tasks.

Abstract

The recent trend of large language models (LLMs) is to increase the scale of both model size (\aka the number of parameters) and dataset to achieve better generative ability, which is definitely proved by a lot of work such as the famous GPT and Llama. However, large models often involve massive computational costs, and practical applications cannot afford such high prices. However, the method of constructing a strong model architecture for LLMs is rarely discussed. We first analyze the state-of-the-art language model architectures and observe the feature collapse problem. Based on the theoretical analysis, we propose that the nonlinearity is also very important for language models, which is usually studied in convolutional neural networks for vision tasks. The series informed activation function is then introduced with tiny calculations that can be ignored, and an augmented shortcut is further used to enhance the model nonlinearity. We then demonstrate that the proposed approach is significantly effective for enhancing the model nonlinearity through carefully designed ablations; thus, we present a new efficient model architecture for establishing modern, namely, PanGu-$π$. Experiments are then conducted using the same dataset and training strategy to compare PanGu-$π$ with state-of-the-art LLMs. The results show that PanGu-$π$-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-$π$-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-$π$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.

PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

TL;DR

Abstract

. Experiments are then conducted using the same dataset and training strategy to compare PanGu-

with state-of-the-art LLMs. The results show that PanGu-

-7B can achieve a comparable performance to that of benchmarks with about 10\% inference speed-up, and PanGu-

-1B can achieve state-of-the-art performance in terms of accuracy and efficiency. In addition, we have deployed PanGu-

-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application. The results show that YunShan can surpass other models with similar scales on benchmarks.

Paper Structure (29 sections, 13 theorems, 50 equations, 6 figures, 10 tables)

This paper contains 29 sections, 13 theorems, 50 equations, 6 figures, 10 tables.

Introduction
Related Works
LLMs
Enhanced Transformer Architectures
LLMs for Finance and Law
Preliminaries and Motivation
PanGu-$\pi$ Modules and Architectures
Augmented Shortcut
Series Informed Activation Function
Combination
Experiments on General Field
Ablation Studies
Feature Analysis and Visualization
Comparison with 7B Models
Comparison with 1B Models
...and 14 more sections

Key Result

Lemma 1

For self-attention matrix $\boldsymbol{A} \in \mathbb{R}^{N\times N}$, any weight matrix $\boldsymbol{W} \in \mathbb{R}^{d\times m}$, any $\boldsymbol{H},\boldsymbol{B}\in \mathbb{R}^{N\times d}$, $\alpha_1, \alpha_2 \geq 0$ and $\sigma$ is the nonlinear Lipschitz continuous activation function, we where $s$ is the largest singular value of $\boldsymbol{W}$, $\lambda_{\max}$ is the largest eigenv

Figures (6)

Figure 1: Statistics of domain specialized LLMs. The general LLMs face challenges in supporting industry applications, leading to a growing emphasis on domain specialized LLMs. Among them, the fields of finance and law are particularly active.
Figure 2: The diagram of the proposed PanGu-$\pi$ architecture. The series activation function is adapted to FFN, and the augmented shortcuts are integrated into MSA, which effectively introduces more nonlinearity into the Transformer architecture.
Figure 3: The diagram of MSA module equipped with augmented shortcuts, where different patterns (rectangle, triangle, etc.) denote different features from various tokens. The original identity shortcut copies the input feature while the augmented shortcuts (Aug-S) project features of each input token to diverse representations.
Figure 4: The effective dimension $d(0.8)$ across layers of different model architectures. A larger number of effective dimensions means more principal components are needed to account for 80% of variance, indicating more diversity in feature channels.
Figure 5: Visualization of hidden states from each layer. The most frequent five tokens are highlighted using different colors for visualization. The total variance accounted for by the first three principal components is labeled for each layer on the top. Note that the beginning tokens are removed from the analysis because they are considered outliers.
...and 1 more figures

Theorems & Definitions (26)

Lemma 1
Theorem 1
Lemma 2
Theorem 2
Theorem 3
Theorem 4
Lemma 3
Lemma 4
Theorem 5
Theorem 6
...and 16 more

PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

TL;DR

Abstract

PanGu-$π$: Enhancing Language Model Architectures via Nonlinearity Compensation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (26)