Table of Contents
Fetching ...

tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai

TL;DR

tLoRA addresses the inefficiency of training multiple heterogeneous LoRA adapters in shared GPU clusters by unifying them under a Shared Super-Model (SSM). It combines a fused, adapter-aware kernel (Kernel Fuser) with an online, residual-capacity–aware Adapter Scheduler and an elastic grouping strategy (Hierarchical Incremental Grouping) to maximize cluster throughput while preserving per-job progress. The approach is compatible with existing distributed stacks and preserves training semantics, enabling scalable multi-LoRA training with improved throughput, faster job completion, and higher GPU utilization. Realistic traces and micro-benchmarks show substantial gains, highlighting tLoRA’s potential to democratize rapid, multi-tenant model fine-tuning at scale.

Abstract

As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.

tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models

TL;DR

tLoRA addresses the inefficiency of training multiple heterogeneous LoRA adapters in shared GPU clusters by unifying them under a Shared Super-Model (SSM). It combines a fused, adapter-aware kernel (Kernel Fuser) with an online, residual-capacity–aware Adapter Scheduler and an elastic grouping strategy (Hierarchical Incremental Grouping) to maximize cluster throughput while preserving per-job progress. The approach is compatible with existing distributed stacks and preserves training semantics, enabling scalable multi-LoRA training with improved throughput, faster job completion, and higher GPU utilization. Realistic traces and micro-benchmarks show substantial gains, highlighting tLoRA’s potential to democratize rapid, multi-tenant model fine-tuning at scale.

Abstract

As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
Paper Structure (38 sections, 3 equations, 18 figures, 1 algorithm)

This paper contains 38 sections, 3 equations, 18 figures, 1 algorithm.

Figures (18)

  • Figure 1: Adapter heterogeneity (e.g., in rank and batch size) creates tension between throughput and per-job latency in multi-LoRA training.
  • Figure 2: Naïve batch LoRA training may hurt aggregate training throughput. (Llama3.1-8B)
  • Figure 3: Lifecycle of multi-adapter LoRA training with tLoRA.
  • Figure 4: Execution of a micro-batch using our Kernel Fuser
  • Figure 5: Training throughput.
  • ...and 13 more figures