Table of Contents
Fetching ...

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Chakshu Moar, Faraz Tahmasebi, Michael Pellauer, Hyoukjun Kwon

TL;DR

The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.

Abstract

Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O($2^{39}$) for Llama2-7B). To navigate such a vast design space, we formulate it and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p (\%p refers to "percentage point," which refers to the absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p increase) to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

TL;DR

The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.

Abstract

Recent large language models (LLMs) employ billions of parameters to enable broad problem-solving capabilities. Such language models also tend to be memory-bound because of the dominance of matrix-vector and matrix-matrix multiplications with low arithmetic intensity. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today. Model compression methods such as quantization and parameter pruning have been actively explored to achieve memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, specifically Tucker decomposition, on recent language models, including an open-source LLM, Llama 2. We formalize the low-rank decomposition design space and show that the decomposition design space is enormous (e.g., O() for Llama2-7B). To navigate such a vast design space, we formulate it and perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9\% model size reduction with minimal accuracy drops, which range from 4\%p (\%p refers to "percentage point," which refers to the absolute difference between two percentage numbers; 74\% -> 78\% = 4\%p increase) to 10\%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service at scale (e.g., AI agent and real-time coding assistant), where the latency is as important as the model accuracy.
Paper Structure (21 sections, 1 theorem, 28 equations, 11 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 28 equations, 11 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Decomposition Design Space Size ($|S_{LR}|$) For a given model $m$ and its decomposition design space($S_{LR}(m)$),

Figures (11)

  • Figure 1: An illustration of Tucker Decomposition. A three-dimensional tensor T can be decomposed into one core tensor and three-factor matrices, $U^{1}$, $U^{2}$, and $U^{3}$. the dimension of the core tensor corresponds to the rank of the decomposition.
  • Figure 2: The layer architecture of two recent language models: Bert kenton2019bert and Llama2 touvron2023llama. Lin, BMM, and RMS refer to the linear layer, batched matrix multiplication, and root mean square, respectively. We highlight decomposable weight tensors using yellow boxes.
  • Figure 3: A high-level example roofline that captures memory- and computation-boundness of workloads based on their operational intensity. BW, C, and M indicate bandwidth, compute, and memory, respectively. The example execution timelines in (b) and (c) show how the compute and memory latency overlaps for compute tiles for memory- and compute-bound scenarios.
  • Figure 4: An illustration of three axes of the decomposition configurations discussed in \ref{['def:decomp_config']}. (a): Choice of the layers to decompose, (b): Choice of tensors within each layer to decompose, (c): The choice of pruned rank (PR) to be used for each decomposed tensor. "Lin." refers to the linear layer.
  • Figure 5: Impact of Rank on Accuracy. We prune ranks from the original (4096) to 500, 250, and 1. By pruned rank (PR), we refer to the remaining rank after rank pruning. The accuracy with no decomposition is based on the reported accuracy in the original Llama2 publication touvron2023llama.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof