Symbiosis: Multi-Adapter Inference and Fine-Tuning
Saransh Gupta, Umesh Deshpande, Travis Janssen, Swami Sundararaman
TL;DR
Symbiosis introduces a split-execution framework that treats the base model as a shared, serviceable substrate while isolating client-specific adapters. By decoupling base layers (base executor) from client layers (adapters, attention), it enables cross-client sharing, flexible placement across GPUs/CPUs, and privacy-preserving multi-tenancy. The system supports multiple PEFT methods, opportunistic per-layer batching, long-context inference with heterogeneous compute, and privacy mechanisms for adapter confidentiality, achieving memory efficiency and improved throughput for multi-adapter workloads. Empirical evaluations across single/multi-GPU, remote/sharded configurations, and CPU-GPU setups demonstrate substantial memory savings, higher adapter throughput, and end-to-end performance gains, especially for long contexts and heterogeneous environments.
Abstract
Parameter-efficient fine-tuning (PEFT) allows model builders to capture the task-specific parameters into adapters, which are a fraction of the size of the original base model. Popularity of PEFT technique for fine-tuning has led to the creation of a large number of adapters for popular Large Language Models (LLMs). However, existing frameworks fall short in supporting inference or fine-tuning with multiple adapters in the following ways. 1) For fine-tuning, each job needs to deploy its dedicated base model instance, which results in excessive GPU memory consumption and poor GPU utilization. 2) While popular inference platforms can serve multiple PEFT adapters, they do not allow independent resource management or mixing of different PEFT methods. 3) They cannot make effective use of heterogeneous accelerators. 4) They do not provide privacy to users who may not wish to expose their fine-tuned parameters to service providers. In Symbiosis, we address the above problems by enabling the as-a-service deployment of the base model. The base model layers can be shared across multiple inference or fine-tuning processes. Our split-execution technique decouples the execution of client-specific adapters and layers from the frozen base model layers offering them flexibility to manage their resources, to select their fine-tuning method, to achieve their performance goals. Our approach is transparent to models and works out-of-the-box for most models in the transformers library. We demonstrate the use of Symbiosis to simultaneously fine-tune 20 Gemma2-27B adapters on 8 GPUs.
