Table of Contents
Fetching ...

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

TL;DR

MASA introduces matrix-based dictionary learning to share attention projections across transformer layers, drastically reducing attention parameters by 66.7% while maintaining performance. By decomposing Q, K, V, and O into shared dictionary atoms with per-layer coefficients, MASA provides a principled, plug-and-play approach that works from scratch and extends to pretrained models via Matrix PCA, grouping, and data-aware local refinement. Empirical results across language and vision tasks show MASA surpasses or matches state-of-the-art compression methods (GQA, low-rank, sequential/Repeat-sharing) at multiple scales, including pretrained LLMs with training-free adaptation. The framework offers a scalable path to parameter-efficient Transformers with practical deployment benefits in both research and industry settings.

Abstract

Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

TL;DR

MASA introduces matrix-based dictionary learning to share attention projections across transformer layers, drastically reducing attention parameters by 66.7% while maintaining performance. By decomposing Q, K, V, and O into shared dictionary atoms with per-layer coefficients, MASA provides a principled, plug-and-play approach that works from scratch and extends to pretrained models via Matrix PCA, grouping, and data-aware local refinement. Empirical results across language and vision tasks show MASA surpasses or matches state-of-the-art compression methods (GQA, low-rank, sequential/Repeat-sharing) at multiple scales, including pretrained LLMs with training-free adaptation. The framework offers a scalable path to parameter-efficient Transformers with practical deployment benefits in both research and industry settings.

Abstract

Large language models (LLMs) have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g. low-rank approximation, attention head pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in CNNs, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices into shared dictionary atoms, reducing the attention module's parameters by 66.7% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement - trained with standard optimizers - and represents each layer's weights as linear combinations of shared matrix atoms. Experiments across scales (100M-700M parameters) show that MASA achieves better benchmark accuracy and perplexity than grouped-query attention (GQA), low-rank baselines and recently proposed Repeat-all-over/Sequential sharing at comparable parameter budgets. Ablation studies confirm robustness to the dictionary size and the efficacy of shared representations in capturing cross-layer statistical regularities. Extending to Vision Transformers (ViT), MASA matches performance metrics on image classification and detection tasks with 66.7% fewer attention parameters. By combining dictionary learning strategies with transformer efficiency, MASA offers a scalable blueprint for parameter-efficient models without sacrificing performance. Finally, we investigate the possibility of employing MASA on pretrained LLMs to reduce their number of parameters without experiencing any significant drop in their performance.

Paper Structure

This paper contains 40 sections, 12 equations, 44 figures, 10 tables, 1 algorithm.

Figures (44)

  • Figure 1: MASA framework: (Left) Independent dictionary pools for Q, K, V, O projections. (Middle) Per-block projection matrices synthesized via weighted combinations of shared dictionaries (example: Block l). All blocks share dictionary pools while using unique linear coefficients for each Transformer block.
  • Figure 2: Evaluation results of different ViT models trained from scratch on CIFAR100 train data, the blue solid plot represents the Top1-Accuracy of the vanilla attention models, the green solid plot represents the Top1-Accuracy of MASA, the dotted lines represent the parameter count of the full models respectivly.
  • Figure 11: Evaluation results of different ViT models trained from scratch on CIFAR10 train data, the blue solid plot represents the Top1-Accuracy of the vanilla attention models, the green solid plot represents the Top1-Accuracy of MASA, the dotted lines represent the parameter count of the full models respectivly.
  • Figure 12: Evaluation results of different ViT models trained from scratch on TinyImageNet train data, the blue solid plot represents the Top1-Accuracy of the vanilla attention models, the green solid plot represents the Top1-Accuracy of MASA, the dotted lines represent the parameter count of the full models respectivly.
  • Figure : (A) $D^Q$
  • ...and 39 more figures