Table of Contents
Fetching ...

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi

TL;DR

This work systematically evaluates LoRA-based fine-tuning across 10 base models and 31 tasks (310 models) to quantify performance gains and deployment viability. By standardizing training and using LoRAX for multi-model serving, the study demonstrates that 4-bit LoRA can significantly outperform base models and often rival GPT-4 on narrow tasks, while large, instruction-tuned bases and careful base-model selection drive the best gains. The LoRA Land deployment showcases practical, cost-efficient hosting of many specialized LLMs on a single GPU, with LoRAX enabling dynamic adapter loading and scalable throughput. Overall, the results support the practicality of deploying multiple task-specialized LoRAs over a single general LLM, aided by predictive insights from task-complexity heuristics.

Abstract

Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for Parameter Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs). LoRA reduces the number of trainable parameters and memory usage while achieving comparable performance to full fine-tuning. We aim to assess the viability of training and serving LLMs fine-tuned with LoRA in real-world applications. First, we measure the quality of LLMs fine-tuned with quantized low rank adapters across 10 base models and 31 tasks for a total of 310 models. We find that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average. Second, we investigate the most effective base models for fine-tuning and assess the correlative and predictive capacities of task complexity heuristics in forecasting the outcomes of fine-tuning. Finally, we evaluate the latency and concurrency capabilities of LoRAX, an open-source Multi-LoRA inference server that facilitates the deployment of multiple LoRA fine-tuned models on a single GPU using shared base model weights and dynamic adapter loading. LoRAX powers LoRA Land, a web application that hosts 25 LoRA fine-tuned Mistral-7B LLMs on a single NVIDIA A100 GPU with 80GB memory. LoRA Land highlights the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM.

LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report

TL;DR

This work systematically evaluates LoRA-based fine-tuning across 10 base models and 31 tasks (310 models) to quantify performance gains and deployment viability. By standardizing training and using LoRAX for multi-model serving, the study demonstrates that 4-bit LoRA can significantly outperform base models and often rival GPT-4 on narrow tasks, while large, instruction-tuned bases and careful base-model selection drive the best gains. The LoRA Land deployment showcases practical, cost-efficient hosting of many specialized LLMs on a single GPU, with LoRAX enabling dynamic adapter loading and scalable throughput. Overall, the results support the practicality of deploying multiple task-specialized LoRAs over a single general LLM, aided by predictive insights from task-complexity heuristics.

Abstract

Low Rank Adaptation (LoRA) has emerged as one of the most widely adopted methods for Parameter Efficient Fine-Tuning (PEFT) of Large Language Models (LLMs). LoRA reduces the number of trainable parameters and memory usage while achieving comparable performance to full fine-tuning. We aim to assess the viability of training and serving LLMs fine-tuned with LoRA in real-world applications. First, we measure the quality of LLMs fine-tuned with quantized low rank adapters across 10 base models and 31 tasks for a total of 310 models. We find that 4-bit LoRA fine-tuned models outperform base models by 34 points and GPT-4 by 10 points on average. Second, we investigate the most effective base models for fine-tuning and assess the correlative and predictive capacities of task complexity heuristics in forecasting the outcomes of fine-tuning. Finally, we evaluate the latency and concurrency capabilities of LoRAX, an open-source Multi-LoRA inference server that facilitates the deployment of multiple LoRA fine-tuned models on a single GPU using shared base model weights and dynamic adapter loading. LoRAX powers LoRA Land, a web application that hosts 25 LoRA fine-tuned Mistral-7B LLMs on a single NVIDIA A100 GPU with 80GB memory. LoRA Land highlights the quality and cost-effectiveness of employing multiple specialized LLMs over a single, general-purpose LLM.
Paper Structure (30 sections, 9 figures, 14 tables)

This paper contains 30 sections, 9 figures, 14 tables.

Figures (9)

  • Figure 2: Examples of different styles of prompting. To maintain using the same prompts when comparing models and to ensure the highest likelihood of success amongst all types of models (fine-tuned, auto-complete, or instruction-tuned), all of our prompts adhere to completion style.
  • Figure 3: Example LLM model training configuration for LoRA-based fine-tuning. Based on Ludwig 62Molino2019.
  • Figure 4: Performance lift from the best fine-tuned LLM over 1) the best base model (<= 7B) (in blue) and GPT-4 (in red) across 31 tasks, in absolute points.
  • Figure 5: Frequency of base models (with fine-tuning) as the top performer for a task. Ties, namely for the customer_support task where most models attain 100% perfect scores, are excluded.
  • Figure 6: Comparison of auto-complete vs. instruction-tuned base models, before and after fine-tuning.
  • ...and 4 more figures