Batched Low-Rank Adaptation of Foundation Models
Yeming Wen, Swarat Chaudhuri
TL;DR
LoRA enables parameter-efficient fine-tuning but bottlenecks real-time serving when each request requires a different adapter. The authors propose fast LoRA (fLoRA), which assigns per-example adapters within a minibatch and expresses the forward as $Y = \phi( A circ ( ( B circ X ) W0 ) )$, keeping expressive power while enabling vectorized batching. Empirically, fLoRA achieves up to about 2× throughput and latency reduction at low ranks, while maintaining accuracy on multilingual code generation across 8 languages and multilingual speech recognition across 6 languages, using StarCoder 15B/3B/1B and Whisper 1.5B with 8-bit quantization. This work provides a practical path for deploying personalized, task-specific adaptations in production with existing batch-serving pipelines, without retraining full foundation models.
Abstract
Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
