CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

Weilin Zhao; Yuxiang Huang; Xu Han; Zhiyuan Liu; Zhengyan Zhang; Kuai Li; Chen Chen; Tao Yang; Maosong Sun

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

Weilin Zhao, Yuxiang Huang, Xu Han, Zhiyuan Liu, Zhengyan Zhang, Kuai Li, Chen Chen, Tao Yang, Maosong Sun

TL;DR

CA-LoRA enables efficient multi-tasking on personal devices by adapting existing LoRA modules trained on an uncompressed LLM to a compressed LLM (CLM). It introduces LoRA Knowledge Inheritance to initialize CLM LoRA modules from the LLM and Knowledge Recovery to compensate for compression-induced knowledge loss via low-rank, non-linear modules guided by a distillation objective. Empirical results across 11 NLP tasks and multiple compression schemes show that CA-LoRA consistently surpasses vanilla LoRA on CLMs and attains performance close to the uncompressed LLM with LoRA, while maintaining a small set of trainable parameters. This framework advances practical on-device AI by reducing storage overhead and enabling scalable multi-tasking without sacrificing performance.

Abstract

Recently, there has been a demand to deploy Large Language Models (LLMs) on personal devices such as laptops and smartphones. These LLMs have different model variants when handling different tasks. However, personal devices have limited resources and require reduced storage overhead. To address this, there are two key methods available: the first is model compression, which compresses LLMs into smaller sizes; the second is LoRA, which can transfer an LLM to other tasks with very few parameters, avoiding the storage of multiple model variants in multi-task scenarios by only preserving LoRAs. However, our experiments show that directly combining these two methods yields sub-optimal performance. Considering that the open-source community has already contributed many LoRAs to LLMs, we propose to adapt these existing LoRAs from the LLMs to their compressed version and introduce a Compression-Aware LoRA (CA-LoRA) framework. We incorporate knowledge inheritance and recovery strategies to recover the lost knowledge caused by model compression. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM and achieves comparable performance to the non-compressed LLM with existing LoRA modules. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

TL;DR

Abstract

Paper Structure (23 sections, 6 equations, 5 figures, 7 tables)

This paper contains 23 sections, 6 equations, 5 figures, 7 tables.

Introduction
Related Work
Parameter-Efficient Fine-Tuning
Model Compression
Multi-tasking
Methodology
Preliminary
Framework
LoRA Knowledge Inheritance
Model Knowledge Recovery
Experiments and Analyses
The Performance on Typical NLP Tasks
The Performance on Instruction Tuning
General Instruction Tuning
Task-specific Instruction Tuning
...and 8 more sections

Figures (5)

Figure 1: The training and deployment pattern of previous methods and our CA-LoRA.
Figure 2: The overall design of our CA-LoRA.
Figure 3: For typical NLP tasks, the results of LoRA and CA-LoRA on different CLMs (%).
Figure 4: The performance of LoRA, QLoRA and CA-LoRA on HumanEval by instruction-tuning on CodeAlpaca-20k.
Figure 5: The convergence of vanilla LoRA, inherited LoRA (CA-LoRA without recovery modules), and CA-LoRA.

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

TL;DR

Abstract

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

Authors

TL;DR

Abstract

Table of Contents

Figures (5)