Table of Contents
Fetching ...

ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

Chuiyang Meng, Ming Tang, Vincent W. S. Wong

TL;DR

This paper proposes ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation that outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.

Abstract

Federated fine-tuning of large language models (LLMs) enables collaborative tuning across distributed clients. However, due to the large size of LLMs, local updates in federated learning (FL) may incur substantial video random-access memory (VRAM) usage. Moreover, frequent model exchange may lead to significant communication overhead. To tackle these challenges, in this paper we propose ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation. ZorBA leverages zeroth-order optimization to eliminate the storage of gradients at the clients by forward passes. ZorBA includes a heterogeneous block activation mechanism in which the central server allocates different subsets of transformer blocks to clients in order to accelerate the convergence rate and reduce the VRAM usage. Furthermore, ZorBA utilizes shared random seeds and the finite differences of gradients in order to reduce the communication overhead. We conduct theoretical analysis to characterize the effect of block activation decisions on the convergence rate and VRAM usage. To jointly enhance the convergence rate and reduce the VRAM usage, we formulate an optimization problem to optimize the block activation decisions. We propose an $ε$-constraint lexicographic algorithm to solve this problem. Experimental results show that ZorBA outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.

ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation

TL;DR

This paper proposes ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation that outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.

Abstract

Federated fine-tuning of large language models (LLMs) enables collaborative tuning across distributed clients. However, due to the large size of LLMs, local updates in federated learning (FL) may incur substantial video random-access memory (VRAM) usage. Moreover, frequent model exchange may lead to significant communication overhead. To tackle these challenges, in this paper we propose ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation. ZorBA leverages zeroth-order optimization to eliminate the storage of gradients at the clients by forward passes. ZorBA includes a heterogeneous block activation mechanism in which the central server allocates different subsets of transformer blocks to clients in order to accelerate the convergence rate and reduce the VRAM usage. Furthermore, ZorBA utilizes shared random seeds and the finite differences of gradients in order to reduce the communication overhead. We conduct theoretical analysis to characterize the effect of block activation decisions on the convergence rate and VRAM usage. To jointly enhance the convergence rate and reduce the VRAM usage, we formulate an optimization problem to optimize the block activation decisions. We propose an -constraint lexicographic algorithm to solve this problem. Experimental results show that ZorBA outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.
Paper Structure (14 sections, 35 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 14 sections, 35 equations, 4 figures, 2 tables, 3 algorithms.

Figures (4)

  • Figure 1: VRAM usage during zeroth-order optimization with OPT-125M. The forward-pass activations per block include tensors for hidden tensors, projections for Q, K, and V in transformer blocks, and feed-forward networks (FFNs).
  • Figure 2: Illustration of our proposed ZorBA framework.
  • Figure 3: An example of different block activation decisions on three clients. We consider that each client's model has three blocks. Each column denotes the blocks for each client. The blue square denotes the activated block. The pink square denotes the frozen block. The value of the block in the $m$-th row and $n$-th column corresponds to the aggregation weight, i.e., $\frac{a_{m,n}}{\sum_{n'\in\mathcal{N}}a_{m,n'}}$.
  • Figure 4: (a) Pareto front of $\Lambda$ versus the total VRAM usage ratio. (b) Comparison between SC1, SC2, and SC3 for the number of rounds to achieve the target average testing accuracy and the total VRAM usage.

Theorems & Definitions (7)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof