Table of Contents
Fetching ...

SWSC: Shared Weight for Similar Channel in LLM

Binrui Zeng, Yongtao Tang, Xiaodong Liu, Xiaopeng Li

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained platforms by introducing SWSC, a weight compression technique based on Shared Weight for Similar Channel. SWSC performs channel-wise K-Means clustering to replace similar channel vectors with a shared representative, reducing matrix storage from $m^2$ to $(k+1)m$ per weight matrix. To mitigate approximation-induced accuracy loss, it applies singular value decomposition to the weight error $W_{Err}=W-W'$, keeping the top $r$ components to form an error-correction term that is added during inference. Experiments on Llama-2-7B with WikiText-2 show SWSC often outperforms RTN quantization at low average bits, demonstrating effective, orthogonal compression that enables more efficient deployment of LLMs on constrained hardware.

Abstract

Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.

SWSC: Shared Weight for Similar Channel in LLM

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained platforms by introducing SWSC, a weight compression technique based on Shared Weight for Similar Channel. SWSC performs channel-wise K-Means clustering to replace similar channel vectors with a shared representative, reducing matrix storage from to per weight matrix. To mitigate approximation-induced accuracy loss, it applies singular value decomposition to the weight error , keeping the top components to form an error-correction term that is added during inference. Experiments on Llama-2-7B with WikiText-2 show SWSC often outperforms RTN quantization at low average bits, demonstrating effective, orthogonal compression that enables more efficient deployment of LLMs on constrained hardware.

Abstract

Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.
Paper Structure (18 sections, 3 figures, 2 tables)

This paper contains 18 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Flowchart of SWSC Compression and Restoration
  • Figure 2: The process of clustering the weights of an LLM by channel and restoring them by taking the mean value.
  • Figure 3: Compression Error Compensation Process. Among them, $W$ and $W'$ are the original matrix and the approximately restored matrix of the LLM respectively, and $W_{Err}$ is the error matrix between the two matrices. $U$, $\Sigma$, and $V$ are the results of singular value decomposition of $W_{Err}$. $W_{Err}'$ is the approximate error matrix, and $W_{New}$ is the matrix that needs to be restored during the final inference.