FLoRA: Fused forward-backward adapters for parameter efficient fine-tuning and reducing inference-time latencies of LLMs
Dhananjaya Gowda, Seoha Song, Junhyun Lee, Harshith Goka
TL;DR
This work addresses the latency and performance challenges of parameter-efficient fine-tuning for large language models by introducing FLoRA, a family of fused forward-backward adapters (FFBA). FFBA fuses forward and backward adapter computations into base-layer projections to reduce inference-time overhead while preserving or enhancing task accuracy across commonsense and arithmetic reasoning and summary/dialogue generation. Empirical results on Llama3.2 1B and 3B-inst models show FFBA variants often surpass LoRA in accuracy on many tasks and achieve substantial latency reductions (up to ~48% TPOT improvement for 3B models) relative to LoRA, approaching the performance of full fine-tuning on several benchmarks. The approach enables more efficient on-device PEFT with improved parallelization, offering a practical path to deploying capable, fine-tuned LLMs with limited compute budgets.
Abstract
As the large language models (LLMs) grow in size each day, efficient training and fine-tuning has never been as important as nowadays. This resulted in the great interest in parameter efficient fine-tuning (PEFT), and effective methods including low-rank adapters (LoRA) has emerged. Although the various PEFT methods have been studied extensively in the recent years, the greater part of the subject remains unexplored with the huge degree of freedom. In this paper, we propose FLoRA, a family of fused forward-backward adapters (FFBA) for parameter-efficient fine-tuning of LLMs on downstream tasks. The FFBA combine ideas from the popular LoRA and parallel adapters to improve the overall fine-tuning accuracies. At the same time, latencies are minimized by fusing the forward and backward adapters into existing projection layers of the base model. Experimental results show that the proposed FFB adapters perform significantly better than the popularly used LoRA in both accuracy and latency for a similar parameter budget.
