Learning Parameter Sharing with Tensor Decompositions and Sparsity

Cem Üyük; Mike Lasby; Mohamed Yassin; Utku Evci; Yani Ioannou

Learning Parameter Sharing with Tensor Decompositions and Sparsity

Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou

TL;DR

FiPS introduces a fine-grained parameter sharing framework based on sparse tensor decomposition to compress Vision Transformers and Large Language Models. By learning a shared basis $\mathbf{U}$ and sparse factors $\mathbf{V}$ such that $\mathbf{W}=\mathbf{U}\mathbf{V}$, FiPS enables neurons to be shared across multiple MLP blocks, significantly reducing parameter counts while preserving accuracy and perplexity. The method proceeds via Shared Initialization, Local Error Minimization, and optional Global Error Minimization, using truncated SVD initialization and GMP/ RigL-based sparse training. Across DeiT-B, Swin-L, Gemma-2, and Llama-3, FiPS achieves 40–75% parameter reduction with negligible accuracy loss and favorable latency/memory improvements, demonstrating strong practicality for deployment on resource-constrained devices.

Abstract

Large neural networks exhibit exceptional performance across numerous tasks, yet their considerable size often hinders deployment on resource-constrained systems. While various model compression strategies have been well studied, parameter sharing remains underexplored. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a novel algorithm that leverages parameter sharing, tensor decomposition, and sparsity to effectively compress large-scale Vision Transformers (ViTs) and Large Language Models (LLMs). FiPS employs a shared base and sparse factors to represent neurons across multi-layer perceptron (MLP) modules, where initialization is guided by Singular Value Decomposition (SVD) and subsequent optimization is conducted through block-wise reconstruction error minimization. Experimental results show that FiPS reduces the parameter budget of MLP modules by 50-75% for DeiT-B and Swin-L and by 40-50% for various Gemma-2 and Llama-3 models while maintaining ViT model accuracy within 1% pt. of the original and LLM perplexity with negligible degradation.

Learning Parameter Sharing with Tensor Decompositions and Sparsity

TL;DR

Abstract

Learning Parameter Sharing with Tensor Decompositions and Sparsity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)