Table of Contents
Fetching ...

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

Herbert Woisetschläger, Alexander Isenko, Shiqiang Wang, Ruben Mayer, Hans-Arno Jacobsen

TL;DR

The paper addresses enabling federated fine-tuning of large language models at the network edge under privacy and resource constraints. It adopts a hardware-centric methodology, evaluating FLAN-T5 models from $80\mathrm{M}$ to $3\mathrm{B}$ parameters on Jetson AGX Orin edge devices and comparing against an NVIDIA A100 data-center GPU, using LoRA for parameter-efficient fine-tuning. It introduces energy-efficiency metrics ($\eta_e = \frac{\mathrm{TPS}}{W}$) and Granularity $G = \frac{T_{\mathrm{comp}}}{T_{\mathrm{comm}}}$ to quantify edge FL performance, and compares four optimizers (FedAvg, FedAvgM, FedAdam, FedAdamW) with findings that FedAdamW improves convergence while communication remains a major energy sink, especially at the edge. The study reveals edge memory bandwidth bottlenecks, the strong role of PEFT in improving scalability, and the regulatory imperative for energy-aware FL, outlining concrete steps toward more practical edge-enabled foundation-model training.

Abstract

Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.

Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the Ugly

TL;DR

The paper addresses enabling federated fine-tuning of large language models at the network edge under privacy and resource constraints. It adopts a hardware-centric methodology, evaluating FLAN-T5 models from to parameters on Jetson AGX Orin edge devices and comparing against an NVIDIA A100 data-center GPU, using LoRA for parameter-efficient fine-tuning. It introduces energy-efficiency metrics () and Granularity to quantify edge FL performance, and compares four optimizers (FedAvg, FedAvgM, FedAdam, FedAdamW) with findings that FedAdamW improves convergence while communication remains a major energy sink, especially at the edge. The study reveals edge memory bandwidth bottlenecks, the strong role of PEFT in improving scalability, and the regulatory imperative for energy-aware FL, outlining concrete steps toward more practical edge-enabled foundation-model training.

Abstract

Large Language Models (LLM) and foundation models are popular as they offer new opportunities for individuals and businesses to improve natural language processing, interact with data, and retrieve information faster. However, training or fine-tuning LLMs requires a vast amount of data, which can be challenging to access due to legal or technical restrictions and may require private computing resources. Federated Learning (FL) is a solution designed to overcome these challenges and expand data access for deep learning applications. This paper takes a hardware-centric approach to explore how LLMs can be brought to modern edge computing systems. Our study fine-tunes the FLAN-T5 model family, ranging from 80M to 3B parameters, using FL for a text summarization task. We provide a micro-level hardware benchmark, compare the model FLOP utilization to a state-of-the-art data center GPU, and study the network utilization in realistic conditions. Our contribution is twofold: First, we evaluate the current capabilities of edge computing systems and their potential for LLM FL workloads. Second, by comparing these systems with a data-center GPU, we demonstrate the potential for improvement and the next steps toward achieving greater computational efficiency at the edge.
Paper Structure (24 sections, 3 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Development of computational power and resource availability of DL accelerators 2017 - 2023 for data centers and embedded systems. Key: RPi4 = Raspberry Pi 4, Nano = NVIDIA Jetson Nano, Orin Nano = NVIDIA Jetson Orin Nano, AGX Orin = NVIDIA Jetson AGX Orin 64 GB.
  • Figure 2: DL training step times across FLAN-T5 transformer models with varying minibatch sizes on the Samsum dataset running on the NVIDIA A100 and Jetson AGX Orin platform. Detailed metrics are available in Appendix \ref{['app:results']}.
  • Figure 3: Our NVIDIA Jetson AGX Orin 64GB Testbed. 10 devices with freely configurable network interconnect up to 10 Gbit. Active external cooling is a must at the given energy density of $10 \cdot 60W$ max. power draw.
  • Figure 4: We study the model FLOP utilization (MFU) and the energy efficiency ($\eta_e$) of the FLAN-T5 transformer model family and find a strong correlation between the MFU and $\eta_e$, which is useful to evaluate root causes for poor training speeds in real-time.
  • Figure 5: We show the effectiveness of Federated AdamW by training the FLAN-T5 Model family in a federated setup with 100 clients (10 clients per round). We report the validation loss (left) and the Rouge-1 score (right) as performance indicators. Note: The loss spikes for FLAN-T5 Large originate from an increased sensitivity of LoRA adapters with a large parameter count to non-IID data Babakniya2023.
  • ...and 1 more figures