Table of Contents
Fetching ...

Learning Parameter Sharing with Tensor Decompositions and Sparsity

Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou

TL;DR

FiPS introduces a fine-grained parameter sharing framework based on sparse tensor decomposition to compress Vision Transformers and Large Language Models. By learning a shared basis $\mathbf{U}$ and sparse factors $\mathbf{V}$ such that $\mathbf{W}=\mathbf{U}\mathbf{V}$, FiPS enables neurons to be shared across multiple MLP blocks, significantly reducing parameter counts while preserving accuracy and perplexity. The method proceeds via Shared Initialization, Local Error Minimization, and optional Global Error Minimization, using truncated SVD initialization and GMP/ RigL-based sparse training. Across DeiT-B, Swin-L, Gemma-2, and Llama-3, FiPS achieves 40–75% parameter reduction with negligible accuracy loss and favorable latency/memory improvements, demonstrating strong practicality for deployment on resource-constrained devices.

Abstract

Large neural networks exhibit exceptional performance across numerous tasks, yet their considerable size often hinders deployment on resource-constrained systems. While various model compression strategies have been well studied, parameter sharing remains underexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a novel algorithm that leverages parameter sharing, tensor decomposition, and sparsity to effectively compress large-scale Vision Transformers (ViTs) and Large Language Models (LLMs). FiPS employs a shared base and sparse factors to represent neurons across multi-layer perceptron (MLP) modules, where initialization is guided by Singular Value Decomposition (SVD) and subsequent optimization is conducted through block-wise reconstruction error minimization. Experimental results show that FiPS reduces the parameter budget of MLP modules by 50-75% for DeiT-B and Swin-L and by 40-50% for various Gemma-2 and Llama-3 models while maintaining ViT model accuracy within 1% pt. of the original and LLM perplexity with negligible degradation.

Learning Parameter Sharing with Tensor Decompositions and Sparsity

TL;DR

FiPS introduces a fine-grained parameter sharing framework based on sparse tensor decomposition to compress Vision Transformers and Large Language Models. By learning a shared basis and sparse factors such that , FiPS enables neurons to be shared across multiple MLP blocks, significantly reducing parameter counts while preserving accuracy and perplexity. The method proceeds via Shared Initialization, Local Error Minimization, and optional Global Error Minimization, using truncated SVD initialization and GMP/ RigL-based sparse training. Across DeiT-B, Swin-L, Gemma-2, and Llama-3, FiPS achieves 40–75% parameter reduction with negligible accuracy loss and favorable latency/memory improvements, demonstrating strong practicality for deployment on resource-constrained devices.

Abstract

Large neural networks exhibit exceptional performance across numerous tasks, yet their considerable size often hinders deployment on resource-constrained systems. While various model compression strategies have been well studied, parameter sharing remains underexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a novel algorithm that leverages parameter sharing, tensor decomposition, and sparsity to effectively compress large-scale Vision Transformers (ViTs) and Large Language Models (LLMs). FiPS employs a shared base and sparse factors to represent neurons across multi-layer perceptron (MLP) modules, where initialization is guided by Singular Value Decomposition (SVD) and subsequent optimization is conducted through block-wise reconstruction error minimization. Experimental results show that FiPS reduces the parameter budget of MLP modules by 50-75% for DeiT-B and Swin-L and by 40-50% for various Gemma-2 and Llama-3 models while maintaining ViT model accuracy within 1% pt. of the original and LLM perplexity with negligible degradation.

Paper Structure

This paper contains 46 sections, 2 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: fips (FiPS) employs a shared basis for the sparse weight matrices in the Fully Connected (FC) layers of Multi-Layer Perceptron (MLP) modules across transformer blocks. This approach is detailed in \ref{['sec:ParamSharing']} and \ref{['sec:method']}.
  • Figure 2: Initial Experiments. Reconstruction error when inducing sparsity on different factors of the low-rank decomposition for (\ref{['fig:MLP1_MSE']}) FC-1 and (\ref{['fig:MLP2_MSE']}) FC-2 layers under a 25% parameter budget. Higher sparsity in larger factors ($V$) enables a higher rank and lower reconstruction error. (\ref{['fig:concat_dims']}) Mean reconstruction error across four FC layers under various parameter sharing schemes. See \ref{['sec:WhichDimsToShare']} for a description of each strategy.
  • Figure 3: Parameter Sharing Groups. (\ref{['fig:2d_heat']} top) Mean squared error (MSE) increases when sharing $\mathbf{U}$ across different MLP modules, with red squares indicating that sharing adjacent modules enhances reconstruction. (\ref{['fig:2d_heat']} bottom) MSE for compressing individual MLP modules, demonstrating that sharing $\mathbf{U}$ among consecutive layers typically results in the lowest error. (\ref{['fig:dense_vs_sparse_rank']}) For a fixed parameter budget, the rank of the shared basis $\mathbf{U}$ stabilizes around four MLP modules, aligning with the optimal group size (\ref{['fig:dense_vs_sparse_accuracy']}) for maximizing accuracy in the DeiT-B model.
  • Figure 4: DeiT-B Ablations and Sensitivity Analysis.(\ref{['fig:methods_ablation']}) Analysis of key components in the algorithm, including Random Initialization (RI), Local Pruning (LP), Global Pruning (GP), and Scaling Vectors (SV); (\ref{['fig:sparsity_sweep']}) Impact of varying sparsity levels on performance; (\ref{['fig:epoch_batch_sweep']}) Influence of the calibration dataset size and the number of training epochs on post-compression accuracy using a batch size of 128.
  • Figure 5: DeiT-B Global Sparsity Analysis. (\ref{['fig:sparsity_dist']}) Average sparsity at the end of the training shows that more parameters are allocated to later modules. (\ref{['fig:pearson_corr']}) A strong correlation is observed between the MSE reported in \ref{['fig:2d_heat']} and the parameter distribution identified by .
  • ...and 1 more figures