Table of Contents
Fetching ...

Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

Yang Li, Daniel Agyei Asante, Changsheng Zhao, Ernie Chang, Yangyang Shi, Vikas Chandra

TL;DR

Basel introduces a targeted low-rank decomposition for pretrained LLMs, recasting weight matrices as linear combinations of orthonormal bases and relearning their importance on the target task. By pruning low-importance bases and adding new task-specific bases, Basel achieves deep compression with minimal accuracy loss, outperforming SVD and FWSVD on math reasoning and code generation. The approach also integrates with 8-bit quantization to further reduce size while preserving performance, and maintains practical inference gains in throughput and memory. It offers a scalable alternative to training from scratch, enabling efficient deployment of large models for specific applications. Overall, Basel demonstrates significant, task-aware model shrinking with strong empirical results across multiple model sizes and tasks.

Abstract

Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.

Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications

TL;DR

Basel introduces a targeted low-rank decomposition for pretrained LLMs, recasting weight matrices as linear combinations of orthonormal bases and relearning their importance on the target task. By pruning low-importance bases and adding new task-specific bases, Basel achieves deep compression with minimal accuracy loss, outperforming SVD and FWSVD on math reasoning and code generation. The approach also integrates with 8-bit quantization to further reduce size while preserving performance, and maintains practical inference gains in throughput and memory. It offers a scalable alternative to training from scratch, enabling efficient deployment of large models for specific applications. Overall, Basel demonstrates significant, task-aware model shrinking with strong empirical results across multiple model sizes and tasks.

Abstract

Large language models (LLMs) significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.
Paper Structure (13 sections, 6 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Basel: Identify and select the important bases for target applications during compression.
  • Figure 2: An interpretation of the role of bases from the perspective of signal processing.
  • Figure 3: Pass@1 accuracy and model size of Llama 2-7B compressed with various low-rank algorithms on the mathematical reasoning task.
  • Figure 4: Pass@1 accuracy and model size of Llama 2-13B compressed with various low-rank algorithms on the mathematical reasoning task.
  • Figure 5: Pass@1 accuracy and model size of Llama 2-7B compressed with various low-rank algorithms on the code generation task.
  • ...and 9 more figures