Table of Contents
Fetching ...

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

Weilin Zhao, Yuxiang Huang, Xu Han, Zhiyuan Liu, Zhengyan Zhang, Kuai Li, Chen Chen, Tao Yang, Maosong Sun

TL;DR

CA-LoRA enables efficient multi-tasking on personal devices by adapting existing LoRA modules trained on an uncompressed LLM to a compressed LLM (CLM). It introduces LoRA Knowledge Inheritance to initialize CLM LoRA modules from the LLM and Knowledge Recovery to compensate for compression-induced knowledge loss via low-rank, non-linear modules guided by a distillation objective. Empirical results across 11 NLP tasks and multiple compression schemes show that CA-LoRA consistently surpasses vanilla LoRA on CLMs and attains performance close to the uncompressed LLM with LoRA, while maintaining a small set of trainable parameters. This framework advances practical on-device AI by reducing storage overhead and enabling scalable multi-tasking without sacrificing performance.

Abstract

Recently, there has been a demand to deploy Large Language Models (LLMs) on personal devices such as laptops and smartphones. These LLMs have different model variants when handling different tasks. However, personal devices have limited resources and require reduced storage overhead. To address this, there are two key methods available: the first is model compression, which compresses LLMs into smaller sizes; the second is LoRA, which can transfer an LLM to other tasks with very few parameters, avoiding the storage of multiple model variants in multi-task scenarios by only preserving LoRAs. However, our experiments show that directly combining these two methods yields sub-optimal performance. Considering that the open-source community has already contributed many LoRAs to LLMs, we propose to adapt these existing LoRAs from the LLMs to their compressed version and introduce a Compression-Aware LoRA (CA-LoRA) framework. We incorporate knowledge inheritance and recovery strategies to recover the lost knowledge caused by model compression. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM and achieves comparable performance to the non-compressed LLM with existing LoRA modules. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.

CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices

TL;DR

CA-LoRA enables efficient multi-tasking on personal devices by adapting existing LoRA modules trained on an uncompressed LLM to a compressed LLM (CLM). It introduces LoRA Knowledge Inheritance to initialize CLM LoRA modules from the LLM and Knowledge Recovery to compensate for compression-induced knowledge loss via low-rank, non-linear modules guided by a distillation objective. Empirical results across 11 NLP tasks and multiple compression schemes show that CA-LoRA consistently surpasses vanilla LoRA on CLMs and attains performance close to the uncompressed LLM with LoRA, while maintaining a small set of trainable parameters. This framework advances practical on-device AI by reducing storage overhead and enabling scalable multi-tasking without sacrificing performance.

Abstract

Recently, there has been a demand to deploy Large Language Models (LLMs) on personal devices such as laptops and smartphones. These LLMs have different model variants when handling different tasks. However, personal devices have limited resources and require reduced storage overhead. To address this, there are two key methods available: the first is model compression, which compresses LLMs into smaller sizes; the second is LoRA, which can transfer an LLM to other tasks with very few parameters, avoiding the storage of multiple model variants in multi-task scenarios by only preserving LoRAs. However, our experiments show that directly combining these two methods yields sub-optimal performance. Considering that the open-source community has already contributed many LoRAs to LLMs, we propose to adapt these existing LoRAs from the LLMs to their compressed version and introduce a Compression-Aware LoRA (CA-LoRA) framework. We incorporate knowledge inheritance and recovery strategies to recover the lost knowledge caused by model compression. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM and achieves comparable performance to the non-compressed LLM with existing LoRA modules. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
Paper Structure (23 sections, 6 equations, 5 figures, 7 tables)

This paper contains 23 sections, 6 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The training and deployment pattern of previous methods and our CA-LoRA.
  • Figure 2: The overall design of our CA-LoRA.
  • Figure 3: For typical NLP tasks, the results of LoRA and CA-LoRA on different CLMs (%).
  • Figure 4: The performance of LoRA, QLoRA and CA-LoRA on HumanEval by instruction-tuning on CodeAlpaca-20k.
  • Figure 5: The convergence of vanilla LoRA, inherited LoRA (CA-LoRA without recovery modules), and CA-LoRA.