Table of Contents
Fetching ...

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

Hai Wu, Xu Chen, Kaibin Huang

TL;DR

The paper tackles the challenge of fine-tuning large foundation models at the network edge under memory, compute, and wireless constraints. It introduces LoLa-DEFT, a depth-aware, multi-device cooperative fine-tuning framework, and develops two optimization tools: CRUNCH for efficient depth-aware block-device matching and JBBA (via dual ascent) for joint bandwidth-and-block allocation. Key contributions include the CRUNCH algorithm with a depth-latency monotonicity property, a dual-ascent JBBA framework that couples block assignment with bandwidth, and extensive experimental validation showing up to ~40% per-round latency reductions and substantial on-device memory savings when fine-tuning RoBERTa on GLUE. The results demonstrate practical feasibility for edge-based, privacy-preserving fine-tuning of large FoMo models, enabling more responsive and personalized AI at the network edge.

Abstract

The emergence of large-scale foundation models (FoMo's) that can perform human-like intelligence motivates their deployment at the network edge for devices to access state-of-the-art artificial intelligence. For better user experiences, the pre-trained FoMo's need to be adapted to specialized downstream tasks through fine-tuning techniques. To transcend a single device's memory and computation limitations, we advocate multi-device cooperation within the device-edge cooperative fine-tuning (DEFT) paradigm, where edge devices cooperate to simultaneously optimize different parts of fine-tuning parameters within a FoMo. However, the parameter blocks reside at different depths within a FoMo architecture, leading to varied computation latency-and-memory cost due to gradient backpropagation-based calculations. The heterogeneous on-device computation and memory capacities and channel conditions necessitate an integrated communication-and-computation allocation of local computation loads and communication resources to achieve low-latency (LoLa) DEFT. To this end, we consider the depth-ware DEFT block allocation problem. The involved optimal block-device matching is tackled by the proposed low-complexity Cutting-RecoUNting-CHecking (CRUNCH) algorithm, which is designed by exploiting the monotone-increasing property between block depth and computation latency-and-memory cost. Next, the joint bandwidth-and-block allocation makes the problem more sophisticated. We observe a splittable Lagrangian expression through the transformation and analysis of the original problem, where the variables indicating device involvement are introduced. Then, the dual ascent method is employed to tackle this problem iteratively. Through extensive experiments conducted on the GLUE benchmark, our results demonstrate significant latency reduction achievable by LoLa DEFT for fine-tuning a RoBERTa model.

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

TL;DR

The paper tackles the challenge of fine-tuning large foundation models at the network edge under memory, compute, and wireless constraints. It introduces LoLa-DEFT, a depth-aware, multi-device cooperative fine-tuning framework, and develops two optimization tools: CRUNCH for efficient depth-aware block-device matching and JBBA (via dual ascent) for joint bandwidth-and-block allocation. Key contributions include the CRUNCH algorithm with a depth-latency monotonicity property, a dual-ascent JBBA framework that couples block assignment with bandwidth, and extensive experimental validation showing up to ~40% per-round latency reductions and substantial on-device memory savings when fine-tuning RoBERTa on GLUE. The results demonstrate practical feasibility for edge-based, privacy-preserving fine-tuning of large FoMo models, enabling more responsive and personalized AI at the network edge.

Abstract

The emergence of large-scale foundation models (FoMo's) that can perform human-like intelligence motivates their deployment at the network edge for devices to access state-of-the-art artificial intelligence. For better user experiences, the pre-trained FoMo's need to be adapted to specialized downstream tasks through fine-tuning techniques. To transcend a single device's memory and computation limitations, we advocate multi-device cooperation within the device-edge cooperative fine-tuning (DEFT) paradigm, where edge devices cooperate to simultaneously optimize different parts of fine-tuning parameters within a FoMo. However, the parameter blocks reside at different depths within a FoMo architecture, leading to varied computation latency-and-memory cost due to gradient backpropagation-based calculations. The heterogeneous on-device computation and memory capacities and channel conditions necessitate an integrated communication-and-computation allocation of local computation loads and communication resources to achieve low-latency (LoLa) DEFT. To this end, we consider the depth-ware DEFT block allocation problem. The involved optimal block-device matching is tackled by the proposed low-complexity Cutting-RecoUNting-CHecking (CRUNCH) algorithm, which is designed by exploiting the monotone-increasing property between block depth and computation latency-and-memory cost. Next, the joint bandwidth-and-block allocation makes the problem more sophisticated. We observe a splittable Lagrangian expression through the transformation and analysis of the original problem, where the variables indicating device involvement are introduced. Then, the dual ascent method is employed to tackle this problem iteratively. Through extensive experiments conducted on the GLUE benchmark, our results demonstrate significant latency reduction achievable by LoLa DEFT for fine-tuning a RoBERTa model.
Paper Structure (30 sections, 2 theorems, 27 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 30 sections, 2 theorems, 27 equations, 8 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

The problem $\mathrm{(P3)}$ has at least one feasible solution with the resulting latency being smaller than or equal to $T_{\sf th}$ if and only if

Figures (8)

  • Figure 1: The multi-device cooperative fine-tuning system and operations within one communication round.
  • Figure 2: On-device computation of parameter gradient in device $k$ based on the gradient backpropagation approach. All the activations, i.e., forward intermediate results and block gradients, are recorded in memory for chain rule-based gradient calculation. For multi-device DEFT, one activated device only needs to record the intermediate activations associated with the desired block and its gradient for server update. Once the desired parameter block is calculated, the backpropagation is terminated and the devices can fetch a new local iteration.
  • Figure 3: An example of the proposed CRUNCH algorithm.
  • Figure 4: (a) The average on-device memory load and computation latency of fine-tuning the parameter block in different depths of a RoBERTa base model toward the CoLA task. (b) The average memory cost of a device fine-tuning a RoBERTa base model considering single-device computing and multi-device cooperative computing scenarios.
  • Figure 5: The averaged round latency of multi-device fine-tuning of RoBERTa w.r.t. the different transmit SNR on the task of (a) CoLA and (b) MRPC, respectively.
  • ...and 3 more figures

Theorems & Definitions (6)

  • Proposition 1: Feasibility Checking
  • proof
  • Example 1: Example for Executing CRUNCH
  • Lemma 1
  • proof
  • Remark 1: Bandwidth Compensation for LoLa-DEFT