tLoRA: Efficient Multi-LoRA Training with Elastic Shared Super-Models
Kevin Li, Dibyadeep Saha, Avni Kanodia, Fan Lai
TL;DR
tLoRA addresses the inefficiency of training multiple heterogeneous LoRA adapters in shared GPU clusters by unifying them under a Shared Super-Model (SSM). It combines a fused, adapter-aware kernel (Kernel Fuser) with an online, residual-capacity–aware Adapter Scheduler and an elastic grouping strategy (Hierarchical Incremental Grouping) to maximize cluster throughput while preserving per-job progress. The approach is compatible with existing distributed stacks and preserves training semantics, enabling scalable multi-LoRA training with improved throughput, faster job completion, and higher GPU utilization. Realistic traces and micro-benchmarks show substantial gains, highlighting tLoRA’s potential to democratize rapid, multi-tenant model fine-tuning at scale.
Abstract
As Low-Rank Adaptation (LoRA) becomes the standard approach for efficiently fine-tuning large language models (LLMs), shared clusters increasingly execute many concurrent LoRA training jobs over the same frozen backbone. While recent advances enable batching (co-locating) multiple adapters during serving, efficient training-time co-location of heterogeneous LoRA adapters presents unique challenges. Jobs often differ in adapter rank, batch size, and resource allocation, and naïve batching can introduce synchronization stalls, communication overheads, and per-job slowdowns that are worse than executing independently. We introduce tLoRA, a framework that enables efficient batch training of multiple LoRA jobs. tLoRA fuses adapters that share the same base model into an elastic shared super-model, exploiting existing distributed training frameworks to derive parallelism plans that share resources effectively. At the kernel level, tLoRA employs a fused LoRA kernel that adaptively reconstructs low-rank computation tiles and schedules rank-aware nano-batches to maximize overlap between computation and communication across adapters. At the scheduling layer, tLoRA incorporates an online, residual-capacity-aware scheduler that adaptively groups jobs to maximize collective throughput. Evaluations using real-world cluster traces demonstrate that tLoRA improves training throughput by 1.2--1.8x, job training completion time by 2.3--5.4x, and GPU utilization by 37%.
