Table of Contents
Fetching ...

PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

Injoon Hwang, Haewon Park, Youngwan Lee, Jooyoung Yang, SunJae Maeng

TL;DR

This work tackles the challenge of deploying large pre-trained transformers by combining parameter-efficient fine-tuning with aggressive model compression. It introduces PC-LoRA, which attaches low-rank adapters to linear layers and progressively decays the influence of the pre-trained weights via a decay factor, ultimately leaving only the adapters at inference. The training objective blends a downstream task loss with a feature-based knowledge distillation term to regularize the learning, and the decay schedule lambda(n) governs the transition from base weights to adapters. Empirically, PC-LoRA achieves substantial parameter and FLOPs reductions (about 94% and 89% in vision, 93% and 84% in language models) with modest accuracy degradation, and demonstrates robust performance across ViT and BERT variants, while enabling flexible compression through rank. This approach offers practical impact for deploying compact, fine-tuned models on resource-constrained settings and is compatible with other compression techniques such as quantization.

Abstract

Low-rank adaption (LoRA) is a prominent method that adds a small number of learnable parameters to the frozen pre-trained weights for parameter-efficient fine-tuning. Prompted by the question, ``Can we make its representation enough with LoRA weights solely at the final phase of finetuning without the pre-trained weights?'' In this work, we introduce Progressive Compression LoRA~(PC-LoRA), which utilizes low-rank adaptation (LoRA) to simultaneously perform model compression and fine-tuning. The PC-LoRA method gradually removes the pre-trained weights during the training process, eventually leaving only the low-rank adapters in the end. Thus, these low-rank adapters replace the whole pre-trained weights, achieving the goals of compression and fine-tuning at the same time. Empirical analysis across various models demonstrates that PC-LoRA achieves parameter and FLOPs compression rates of 94.36%/89.1% for vision models, e.g., ViT-B, and 93.42%/84.2% parameters and FLOPs compressions for language models, e.g., BERT.

PC-LoRA: Low-Rank Adaptation for Progressive Model Compression with Knowledge Distillation

TL;DR

This work tackles the challenge of deploying large pre-trained transformers by combining parameter-efficient fine-tuning with aggressive model compression. It introduces PC-LoRA, which attaches low-rank adapters to linear layers and progressively decays the influence of the pre-trained weights via a decay factor, ultimately leaving only the adapters at inference. The training objective blends a downstream task loss with a feature-based knowledge distillation term to regularize the learning, and the decay schedule lambda(n) governs the transition from base weights to adapters. Empirically, PC-LoRA achieves substantial parameter and FLOPs reductions (about 94% and 89% in vision, 93% and 84% in language models) with modest accuracy degradation, and demonstrates robust performance across ViT and BERT variants, while enabling flexible compression through rank. This approach offers practical impact for deploying compact, fine-tuned models on resource-constrained settings and is compatible with other compression techniques such as quantization.

Abstract

Low-rank adaption (LoRA) is a prominent method that adds a small number of learnable parameters to the frozen pre-trained weights for parameter-efficient fine-tuning. Prompted by the question, ``Can we make its representation enough with LoRA weights solely at the final phase of finetuning without the pre-trained weights?'' In this work, we introduce Progressive Compression LoRA~(PC-LoRA), which utilizes low-rank adaptation (LoRA) to simultaneously perform model compression and fine-tuning. The PC-LoRA method gradually removes the pre-trained weights during the training process, eventually leaving only the low-rank adapters in the end. Thus, these low-rank adapters replace the whole pre-trained weights, achieving the goals of compression and fine-tuning at the same time. Empirical analysis across various models demonstrates that PC-LoRA achieves parameter and FLOPs compression rates of 94.36%/89.1% for vision models, e.g., ViT-B, and 93.42%/84.2% parameters and FLOPs compressions for language models, e.g., BERT.
Paper Structure (24 sections, 4 equations, 10 figures, 9 tables)

This paper contains 24 sections, 4 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: The overall diagram of the PC-LoRA method. At each training step, the pre-trained weights and bias gradually decay according to a decay factor $\lambda$, and eventually disappear and only the Low-Rank Adapter corresponding weights $A$, $B$ and bias $C$ remain.
  • Figure 2: The performance comparisons based on different compression ratios of PC-LoRA using ViT-B dosovitskiy2021image compared to the fully finetuned ViT-B on CIFAR-10.
  • Figure 3: The three types of Decay Factor Scheduler: Sine, 1-Cosine, and Linear. As iterations progress, the decay factor decreases from 1 to 0, affecting the rate at which the original weight becomes less influential. Initially, a factor of 1 means the pre-trained model's weights are entirely preserved, while a factor of 0 indicates the complete transition to the new weights.
  • Figure 4: Attention map visualization with [CLS] token: Full-finetuned ViT-B (85.8M) vs. PC-LoRA ViT-B w/ rank=32 (5.94M). Even with a much smaller model size, our compressed ViT shows comparable attention map quality compared to the full-finetuned ViT-B.
  • Figure 5: The performance comparisons based on different compression ratios of PC-LoRA using BERT devlin2019bert compared to the fully finetuned BERT-B on IMDb.
  • ...and 5 more figures