Table of Contents
Fetching ...

BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-Task Large Language Models

Simen Eide, Arnoldo Frigessi

TL;DR

BoRA addresses the trade-off in multi-task LLM fine-tuning by introducing a Bayesian hierarchical prior over per-task LoRA adapters, enabling information sharing across related tasks while preserving task-specific specialization. The method optimizes a MAP objective with AdamW, balancing likelihood and prior through a precision parameter $\\tau$ and learning-rate adjustments. Empirical results on the Talk of Norway dataset show BoRA achieving the best perplexity at $\\tau=100$ and outperforming both independent-task and single-model baselines, with larger gains for data-scarce tasks. This approach generalizes LoRA to a multi-task setting, offering a scalable, data-efficient solution with practical impact for diverse applications, and the authors provide code for replication.

Abstract

This paper introduces Bayesian Hierarchical Low-Rank Adaption (BoRA), a novel method for finetuning multi-task Large Language Models (LLMs). Current finetuning approaches, such as Low-Rank Adaption (LoRA), perform exeptionally well in reducing training parameters and memory usage but face limitations when applied to multiple similar tasks. Practitioners usually have to choose between training separate models for each task or a single model for all tasks, both of which come with trade-offs in specialization and data utilization. BoRA addresses these trade-offs by leveraging a Bayesian hierarchical model that allows tasks to share information through global hierarchical priors. This enables tasks with limited data to benefit from the overall structure derived from related tasks while allowing tasks with more data to specialize. Our experimental results show that BoRA outperforms both individual and unified model approaches, achieving lower perplexity and better generalization across tasks. This method provides a scalable and efficient solution for multi-task LLM finetuning, with significant practical implications for diverse applications.

BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-Task Large Language Models

TL;DR

BoRA addresses the trade-off in multi-task LLM fine-tuning by introducing a Bayesian hierarchical prior over per-task LoRA adapters, enabling information sharing across related tasks while preserving task-specific specialization. The method optimizes a MAP objective with AdamW, balancing likelihood and prior through a precision parameter and learning-rate adjustments. Empirical results on the Talk of Norway dataset show BoRA achieving the best perplexity at and outperforming both independent-task and single-model baselines, with larger gains for data-scarce tasks. This approach generalizes LoRA to a multi-task setting, offering a scalable, data-efficient solution with practical impact for diverse applications, and the authors provide code for replication.

Abstract

This paper introduces Bayesian Hierarchical Low-Rank Adaption (BoRA), a novel method for finetuning multi-task Large Language Models (LLMs). Current finetuning approaches, such as Low-Rank Adaption (LoRA), perform exeptionally well in reducing training parameters and memory usage but face limitations when applied to multiple similar tasks. Practitioners usually have to choose between training separate models for each task or a single model for all tasks, both of which come with trade-offs in specialization and data utilization. BoRA addresses these trade-offs by leveraging a Bayesian hierarchical model that allows tasks to share information through global hierarchical priors. This enables tasks with limited data to benefit from the overall structure derived from related tasks while allowing tasks with more data to specialize. Our experimental results show that BoRA outperforms both individual and unified model approaches, achieving lower perplexity and better generalization across tasks. This method provides a scalable and efficient solution for multi-task LLM finetuning, with significant practical implications for diverse applications.
Paper Structure (16 sections, 6 equations, 5 figures, 2 tables)

This paper contains 16 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Number of training examples for each task.
  • Figure 2: Plot of Precision vs Perplexity. The thick black line is the overall test perplexity across all tasks, and the thinner grey lines represent the test perplexity for each individual task. The leftmost point corresponds to training each task independently ($\tau = 0$), and the rightmost point corresponds to the limiting case when all task-specific model parameters are constrained to be equal ($\tau \rightarrow \infty$).
  • Figure 3: Task dataset size versus the relative improvement in perplexity of the best performing hierarchical model ($\tau = 100$) compared to the case when all models are trained independently ($\tau = 0$) in blue, and compared to the limiting case when all task-specific model parameters are constrained to be equal ($\tau=10000$) in red. The figure shows that all tasks benefit from sharing parameters with the global model, and indicates that tasks with less data benefit more than those with more data.
  • Figure 4: The figure shows the L2-distance between each task's adapter weights and the global prior on the y-axis, and the number of training examples on the x-axis. As expected, tasks with more training data have a larger distance to the global prior.
  • Figure 5: Training size vs. perplexity of the best-performing hierarchical model ($\tau = 100$). The figure shows that there is no strong relationship between dataset size and final perplexity.