Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

Hai Wu; Xu Chen; Kaibin Huang

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

Hai Wu, Xu Chen, Kaibin Huang

TL;DR

The paper tackles the challenge of fine-tuning large foundation models at the network edge under memory, compute, and wireless constraints. It introduces LoLa-DEFT, a depth-aware, multi-device cooperative fine-tuning framework, and develops two optimization tools: CRUNCH for efficient depth-aware block-device matching and JBBA (via dual ascent) for joint bandwidth-and-block allocation. Key contributions include the CRUNCH algorithm with a depth-latency monotonicity property, a dual-ascent JBBA framework that couples block assignment with bandwidth, and extensive experimental validation showing up to ~40% per-round latency reductions and substantial on-device memory savings when fine-tuning RoBERTa on GLUE. The results demonstrate practical feasibility for edge-based, privacy-preserving fine-tuning of large FoMo models, enabling more responsive and personalized AI at the network edge.

Abstract

The emergence of large-scale foundation models (FoMo's) that can perform human-like intelligence motivates their deployment at the network edge for devices to access state-of-the-art artificial intelligence. For better user experiences, the pre-trained FoMo's need to be adapted to specialized downstream tasks through fine-tuning techniques. To transcend a single device's memory and computation limitations, we advocate multi-device cooperation within the device-edge cooperative fine-tuning (DEFT) paradigm, where edge devices cooperate to simultaneously optimize different parts of fine-tuning parameters within a FoMo. However, the parameter blocks reside at different depths within a FoMo architecture, leading to varied computation latency-and-memory cost due to gradient backpropagation-based calculations. The heterogeneous on-device computation and memory capacities and channel conditions necessitate an integrated communication-and-computation allocation of local computation loads and communication resources to achieve low-latency (LoLa) DEFT. To this end, we consider the depth-ware DEFT block allocation problem. The involved optimal block-device matching is tackled by the proposed low-complexity Cutting-RecoUNting-CHecking (CRUNCH) algorithm, which is designed by exploiting the monotone-increasing property between block depth and computation latency-and-memory cost. Next, the joint bandwidth-and-block allocation makes the problem more sophisticated. We observe a splittable Lagrangian expression through the transformation and analysis of the original problem, where the variables indicating device involvement are introduced. Then, the dual ascent method is employed to tackle this problem iteratively. Through extensive experiments conducted on the GLUE benchmark, our results demonstrate significant latency reduction achievable by LoLa DEFT for fine-tuning a RoBERTa model.

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

TL;DR

Abstract

Paper Structure (30 sections, 2 theorems, 27 equations, 8 figures, 1 table, 3 algorithms)

This paper contains 30 sections, 2 theorems, 27 equations, 8 figures, 1 table, 3 algorithms.

Introduction
Models and Operations
Fine-tuning Model
On-device Computation Model
On-device Gradient Backpropagation-based Fine-tuning
Memory-aware On-device Cooperative Computation
On-device Computation Latency
Communication Model
Model Downloading
Gradient Uploading
Global FoMo Updating
Problem Formulation
Depth-Aware Block Allocation for LoLa-DEFT
Optimal Block Allocation via Feasible Set Minimization
Depth-aware Low-complexity Matching Validation
...and 15 more sections

Key Result

Proposition 1

The problem $\mathrm{(P3)}$ has at least one feasible solution with the resulting latency being smaller than or equal to $T_{\sf th}$ if and only if

Figures (8)

Figure 1: The multi-device cooperative fine-tuning system and operations within one communication round.
Figure 2: On-device computation of parameter gradient in device $k$ based on the gradient backpropagation approach. All the activations, i.e., forward intermediate results and block gradients, are recorded in memory for chain rule-based gradient calculation. For multi-device DEFT, one activated device only needs to record the intermediate activations associated with the desired block and its gradient for server update. Once the desired parameter block is calculated, the backpropagation is terminated and the devices can fetch a new local iteration.
Figure 3: An example of the proposed CRUNCH algorithm.
Figure 4: (a) The average on-device memory load and computation latency of fine-tuning the parameter block in different depths of a RoBERTa base model toward the CoLA task. (b) The average memory cost of a device fine-tuning a RoBERTa base model considering single-device computing and multi-device cooperative computing scenarios.
Figure 5: The averaged round latency of multi-device fine-tuning of RoBERTa w.r.t. the different transmit SNR on the task of (a) CoLA and (b) MRPC, respectively.
...and 3 more figures

Theorems & Definitions (6)

Proposition 1: Feasibility Checking
proof
Example 1: Example for Executing CRUNCH
Lemma 1
proof
Remark 1: Bandwidth Compensation for LoLa-DEFT

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

TL;DR

Abstract

Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)