Table of Contents
Fetching ...

Batched Low-Rank Adaptation of Foundation Models

Yeming Wen, Swarat Chaudhuri

TL;DR

LoRA enables parameter-efficient fine-tuning but bottlenecks real-time serving when each request requires a different adapter. The authors propose fast LoRA (fLoRA), which assigns per-example adapters within a minibatch and expresses the forward as $Y = \phi( A circ ( ( B circ X ) W0 ) )$, keeping expressive power while enabling vectorized batching. Empirically, fLoRA achieves up to about 2× throughput and latency reduction at low ranks, while maintaining accuracy on multilingual code generation across 8 languages and multilingual speech recognition across 6 languages, using StarCoder 15B/3B/1B and Whisper 1.5B with 8-bit quantization. This work provides a practical path for deploying personalized, task-specific adaptations in production with existing batch-serving pipelines, without retraining full foundation models.

Abstract

Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.

Batched Low-Rank Adaptation of Foundation Models

TL;DR

LoRA enables parameter-efficient fine-tuning but bottlenecks real-time serving when each request requires a different adapter. The authors propose fast LoRA (fLoRA), which assigns per-example adapters within a minibatch and expresses the forward as , keeping expressive power while enabling vectorized batching. Empirically, fLoRA achieves up to about 2× throughput and latency reduction at low ranks, while maintaining accuracy on multilingual code generation across 8 languages and multilingual speech recognition across 6 languages, using StarCoder 15B/3B/1B and Whisper 1.5B with 8-bit quantization. This work provides a practical path for deploying personalized, task-specific adaptations in production with existing batch-serving pipelines, without retraining full foundation models.

Abstract

Low-Rank Adaptation (LoRA) has recently gained attention for fine-tuning foundation models by incorporating trainable low-rank matrices, thereby reducing the number of trainable parameters. While LoRA offers numerous advantages, its applicability for real-time serving to a diverse and global user base is constrained by its incapability to handle multiple task-specific adapters efficiently. This imposes a performance bottleneck in scenarios requiring personalized, task-specific adaptations for each incoming request. To mitigate this constraint, we introduce Fast LoRA (FLoRA), a framework in which each input example in a minibatch can be associated with its unique low-rank adaptation weights, allowing for efficient batching of heterogeneous requests. We empirically demonstrate that FLoRA retains the performance merits of LoRA, showcasing competitive results on the MultiPL-E code generation benchmark spanning over 8 languages and a multilingual speech recognition task across 6 languages.
Paper Structure (18 sections, 4 equations, 3 figures, 3 tables)

This paper contains 18 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: This shows a pragmatic scenario where a foundation model in production receives four incoming requests, each requiring distinct adapters. Omitting two adapters in step 2 & 3 for presentation simplicity. fLoRA facilitates batching in such serving circumstances, provided the adapters are of low rank, thereby sustaining high throughput and low latency. Detailed discussion on vectorization is provided in \ref{['subsec:computational_efficiency']}.
  • Figure 2: Left: Generation throughput vs. rank for fLoRA and torch.bmm implementation of LoRA, measured in tokens per second (token/s). The experiments were conducted on three starcoder models: StarCoder 15B, StarCoderbase 3B and StarCoderbase 1B. fLoRA has great throughput advantage over LoRA when the rank is small. As the rank increases, the torch.bmm implementation of LoRA finally has a better throughput. Right: Latency vs. rank on StarCoder-15B. Requests are coming at the speed of 8 requests per second.
  • Figure 3: Left: Generation throughput vs. rank for fLoRA and torch.bmm implementation of LoRA, measured in tokens per second (token/s). The experiments were conducted on two Llama-2 models: 13B and 7B Touvron2023Llama2O. fLoRA has great throughput advantage over LoRA when the rank is small. As the rank increases, the torch.bmm implementation of LoRA finally has a better throughput. Right: Latency vs. rank on StarCoder-3B. Requests are coming at the speed of 8 requests per second and 15 requests per second.