Activated LoRA: Fine-tuned LLMs for Intrinsics
Kristjan Greenewald, Luis Lastras, Thomas Parnell, Vraj Shah, Lucian Popa, Giulio Zizzo, Chulaka Gunasekara, Ambrish Rawat, David Cox
TL;DR
This work introduces Activated LoRA (aLoRA), a modular extension of LoRA that activates adapted weights only for tokens produced after invocation, enabling reuse of the base model's KV cache for prior context. This design allows seamless, low-latency switching between specialized intrinsics within multiturn pipelines while preserving the base model's performance on non-intrinsic segments. Empirically, aLoRA achieves substantial inference-time speedups (up to ~35x in targeted setups and meaningful end-to-end gains) while maintaining accuracy on benchmark SFT tasks and real-world intrinsic tasks such as uncertainty quantification, answerability, and query rewriting. The results demonstrate that aLoRA enables efficient, scalable deployment of task-specific capabilities in complex LLM workflows, with broad potential for integration into RAG and agentic systems; the implementation is contributed to Huggingface PEFT.
Abstract
Low-Rank Adaptation (LoRA) has emerged as a highly efficient framework for finetuning the weights of large foundation models, and has become the go-to method for data-driven customization of LLMs. Despite the promise of highly customized behaviors and capabilities, switching between relevant LoRAs in a multiturn setting is inefficient, as the key-value (KV) cache of the entire turn history must be recomputed with the LoRA weights before generation can begin. To address this problem, we propose Activated LoRA (aLoRA), an adapter architecture which modifies the LoRA framework to only adapt weights for the tokens in the sequence after the aLoRA is invoked. This change crucially allows aLoRA to accept the base model's KV cache of the input string, meaning that aLoRA can be instantly activated whenever needed in a chain without recomputing the prior keys and values. This enables building what we call intrinsics, i.e. specialized models invoked to perform well-defined operations on portions of an input chain or conversation that otherwise uses the base model by default. We train a set of aLoRA-based intrinsics models, demonstrating competitive accuracy with standard LoRA while significantly improving inference efficiency. We contributed our Activated LoRA implementation to the Huggingface PEFT library https://github.com/huggingface/peft.
