CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices
Weilin Zhao, Yuxiang Huang, Xu Han, Zhiyuan Liu, Zhengyan Zhang, Kuai Li, Chen Chen, Tao Yang, Maosong Sun
TL;DR
CA-LoRA enables efficient multi-tasking on personal devices by adapting existing LoRA modules trained on an uncompressed LLM to a compressed LLM (CLM). It introduces LoRA Knowledge Inheritance to initialize CLM LoRA modules from the LLM and Knowledge Recovery to compensate for compression-induced knowledge loss via low-rank, non-linear modules guided by a distillation objective. Empirical results across 11 NLP tasks and multiple compression schemes show that CA-LoRA consistently surpasses vanilla LoRA on CLMs and attains performance close to the uncompressed LLM with LoRA, while maintaining a small set of trainable parameters. This framework advances practical on-device AI by reducing storage overhead and enabling scalable multi-tasking without sacrificing performance.
Abstract
Recently, there has been a demand to deploy Large Language Models (LLMs) on personal devices such as laptops and smartphones. These LLMs have different model variants when handling different tasks. However, personal devices have limited resources and require reduced storage overhead. To address this, there are two key methods available: the first is model compression, which compresses LLMs into smaller sizes; the second is LoRA, which can transfer an LLM to other tasks with very few parameters, avoiding the storage of multiple model variants in multi-task scenarios by only preserving LoRAs. However, our experiments show that directly combining these two methods yields sub-optimal performance. Considering that the open-source community has already contributed many LoRAs to LLMs, we propose to adapt these existing LoRAs from the LLMs to their compressed version and introduce a Compression-Aware LoRA (CA-LoRA) framework. We incorporate knowledge inheritance and recovery strategies to recover the lost knowledge caused by model compression. Experiment results demonstrate that CA-LoRA outperforms the vanilla LoRA methods applied to a compressed LLM and achieves comparable performance to the non-compressed LLM with existing LoRA modules. The source code of CA-LoRA is available at https://github.com/thunlp/CA-LoRA.
